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Preface 



The 15th Annual Symposium on Combinatorial Pattern Matching was held in 
Ciragan Palace Hotel, Istanbul, Turkey during July 5-7, 2004. CPM 2004 re- 
peated the success of its predecessors; it even surpassed them in terms of the 
number of invited speakers, the number of submissions, and the number of papers 
accepted and presented at the conference. 

In response to the call for papers, CPM 2004 received a record number of 
79 high-quality submissions. Each submission was reviewed by at least three 
program committee members and the comments were returned to the authors. 
Following an extensive electronic discussion period, the Program Committee 
accepted 36 of the submissions to be presented at the conference. They constitute 
original research contributions in combinatorial pattern matching algorithms and 
data structures, molecular sequence analysis, phylogenetic tree construction, and 
RNA and protein structure analysis and prediction. 

CPM 2004 had five invited speakers. In alphabetical order they were: Evan 
Eichler from the University of Washington, USA, Martin Farach-Colton from 
Rutgers University, USA, Paolo Ferragina from the University of Pisa, Italy, 
Piotr Indyk from MIT, USA, and Gene Myers from the University of California, 
Berkeley, USA. 

It is impossible to organize such a successful program without the help of 
many individuals. We would like to express our appreciation to the authors 
of the submitted papers and to the program committee members and external 
referees, who provided timely and significant reviews. 
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CPM 2004 was locally organized by Bilkent University, Ankara, Turkey. Within 
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Sorting by Reversals in Subquadratic Time* 



Eric Tannier 1 and Marie-France Sagot 1,2 

1 INRIA Rhone- Alpes, Laboratoire de Biometrie et Biologie Evolutive 
Universite Claude Bernard, 69622 Villeurbanne cedex, France 
{Eric . Tannier .Marie-France . Sagot }@inrialpes . f r 
2 King’s College, London, UK 



Abstract. The problem of sorting a signed permutation by reversals 
is inspired by genome rearrangements in computational molecular biol- 
ogy. Given two genomes represented as two signed permutations of the 
same elements ( e.g . orthologous genes), the problem consists in finding 
a most parsimonious scenario of reversals that transforms one genome 
into the other. We propose a method for sorting a signed permutation 
by reversals in time 0(ji\/n log n). The best known algorithms run in 
time 0(n 2 ), the main obstacle to an improvement being a costly opera- 
tion of detection of so-called “safe” reversals. We bypass this detection 
and, using the same data structure as a previous random approximation 
algorithm, we achieve the same subquadratic complexity for finding an 
exact optimal solution. This answers an open question by Ozery-Flato 
and Shamir whether a subquadratic complexity could ever be achieved 
for solving the problem. 



1 Introduction 

The problem of sorting a signed permutation by reversals is inspired by a prob- 
lem of genome rearrangement in computational biology. Genome rearrangements 
such as reversals may change the order of the genes (or of other markers) in a 
genome, and also the direction of transcription. We identify the genes with the 
numbers 1 ,...,n, with a plus or minus sign to indicate such direction. Their 
order will be represented by a signed permutation of {±l,...,±n}, that is a 
permutation ir of {±1, . . . , ±n} such that tt[— i\ = — 7r [ z] , where 7r[?'] denotes 
the i th element in n. In the following, we indicate the sign of an element in a 
permutation only when it is minus. 

The reversal of the interval [i, j ] C [1, n] (* < j) is the signed permutation p = 
1 — j , ...,—(* + 1), j + 1, . . . , n. Note that rrp is the permutation obtained 

from 7r by reversing the order and flipping the signs of the elements in the 
interval. If p \, . . . , pk is a sequence of reversals, we say that it sorts a permutation 
7 r if 7rpi ■ ■ ■ pk = Id (Id is the all positive identity permutation 1, . . . , n). We 
denote by d( 7r) the number of reversals in a minimum size sequence sorting 7r. 

* Work supported by the French program bioinformatique Inter-EPST 2002 “Algo- 
rithms for Modelling and Inference Problems in Molecular Biology”. 



S.C. Sahinalp et al. (Eds.): CPM 2004, LNCS 3109, pp. 1—13, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The problem of sorting by reversals has been the subject of an extensive 
literature. For a complete survey, see [2]. The first polynomial algorithm was 
given by Hannenhalli and Pevzner [4], and ran in 0(n 4 ). After many subsequent 
improvements, the currently fastest algorithms are those of Kaplan, Shamir and 
Tarjan [5] running in 0(n 2 ), a linear algorithm for computing d(ir) only (it 
does not give the sequence of reversals) by Bader, Moret and Yan [1], and an 
0(ji\Jn log n) random algorithm by Kaplan and Verbin [6] which gives most 
of the time an optimal sequence of reversals, but fails on some permutations 
with very high probability. A reversal p is said to be safe for a permutation n if 
d(np) = d( 7r) — 1. The bottleneck for all existing exact algorithms is the detection 
of safe reversals. Many techniques were invented to address this problem, but 
none has a better time complexity than linear, immediately implying a quadratic 
complexity for the whole method. In a recent paper [7], Ozery-Flato and Shamir 
compiled and compared the best algorithms, and wrote that: “A central question 
in the study of genome rearrangements is whether one can obtain a subquadratic 
algorithm for sorting by reversals” . 

In this paper, we give a positive answer to Ozery-Flato and Shamir’s question. 
A good knowledge of the Hannenhalli-Pevzner theory and of the data structure 
used by Kaplan- Verbin is assumed. 

In the next section, we briefly describe the usual tools (in particular the 
omnipresent breakpoint graph) for dealing with permutations and reversals. We 
mention some operations for sorting by reversals without giving any details (con- 
cerning, for instance, hurdle detection and clearing for which linear methods 
exist). The aim of this paper is to deal only with the most costly part of the 
method, that is sorting a permutation without hurdles. We introduce a new 
and elegant way of transforming the breakpoint graph of a permutation 7 r by 
applying reversals either on 7r or on its inverse permutation 7r _1 . In section 3, 
we describe the method to optimally sort by reversals. With any classical data 
structure, the time complexity of this algorithm is 0(n 2 ), but even in this case 
it presents a special interest because it bypasses the detection of safe reversals 
which is considered as the most costly operation. Then in the last section, we 
indicate the data structure used to achieve subquadratic time complexity. 

2 Mathematical Tools 

To simplify exposition, we require that the first and last elements of a permu- 
tation remain unchanged. We therefore adopt the usual transformation which 
consists in adding the elements 0 and n + 1 to {1, . . . ,n}, with 7r[0] = 0, and 
7r[n+ 1] = n+ 1. The obtained permutation is called an augmented permutation. 
In this paper, all permutations are augmented, and we omit to mention it from 
now on. The inverse permutation ir^ 1 of ir is the (signed) permutation such that 
7T7r _1 = Id. 

2.1 Breakpoint Graphs for a Permutation and Its Inverse 

The breakpoint graph of a permutation 7r is a graph with vertex set V 

defined as follows: for each integer i in n}, let i a (the arrival) and i d 
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(the departure) be two vertices in V; add to V the two vertices 0 d and (n + l) a . 
Observe that all vertex labels are non negative numbers. For simplicity and to 
avoid having to use absolute values, we may later refer to vertex (—i) x (for x = a 
or d). This will be the same as vertex i x . The edge set E of BG(n) is the union 
of two perfect matchings denoted by i?, the reality edges and D , the dream edges 
(in the literature, reality and dream edges are sometimes called reality and desire 
edges, or, in a more prosaic way, black and gray edges): 

— R contains the edges (7r[z]) d (7r[z + l]) a for all* £ {0, . . . , n}; 

— D contains an edge for all i £ {0, . . . , n}, from i d if 7r _1 [z] is positive, from 
i a if 7t — 1 [z] is negative, to ( i + l) a if 7r _1 [z + 1] is positive, and to (i + l) d if 
7t — 1 \i + 1] is negative. 

The reality edges define the permutation zr (what you have), and the dream 
edges define Id (what you want to have). Reality edges always go from a de- 
parture to an arrival, but in dreams, everything can happen. An example of a 
breakpoint graph for a permutation is given in Figure 1. 

To avoid case checking, in the notation of an edge, the mention of departures 
and arrivals may be omitted. For instance, z(z + 1) is a dream edge, indicating 
nothing as concerns the signs of n~ l [i\ and 7r — 1 [i + 1]. 




Fig. 1 . The breakpoint graph of the permutation 0 — 4 — 6372 — 51 8. The bold 
edges are reality edges, and the thin ones are dream edges. The permutation should 
be read clockwise from 0 to n+1. This circular representation makes the cycles in the 
graph more visible. The edges that cross in the drawing correspond to crossing edges 
according to the definition. 



A reality edge (7r[z]) d (7r[z + 1])“ is oriented if zr[z] and n[i + 1] have opposite 
signs, and unoriented otherwise. A dream edge z(z + 1) is oriented if 7r _1 [z] and 
7r _1 [z + 1] have opposite signs (that is, if the edge joins two departures or two 
arrivals), and unoriented otherwise. In the example of Figure 1, (0,4), (6,3), 
(2, 5) and (5, 1) are oriented reality edges, while (3, 4), (6, 7) are oriented dream 
edges. 

To every dream edge z(z + 1), we associate the interval [|7r _1 [z]|, |7 t _1 [z + 1]|] 
(or [|7r _1 [z + 1]|, |tt _ 1 [*]|] if 1 7r 1 [z]| > \tt 1 [z T 1]|). Two dream edges are said to 
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cross if their associated intervals intersect but one is not contained in the other. 
Only dream edges may cross in a breakpoint graph. 

Dream and reality edges are trivially and uniquely decomposed into cycles 
(the sets of both types of edges are perfect matchings of the vertices). By the 
cycles of a permutation n, we mean the cycles of R U D in BG(tt). We call the 
size of a cycle the number of dream edges it contains (it is half the usual length 
of a cycle). Two cycles are said to cross if two of their edges cross. 

A component C of BG(tt) is an inclusionwise minimal subset of its cycles, 
such that no cycle of C crosses a cycle outside C. A component is said to be 
oriented if it contains a cycle with an oriented edge, and unoriented otherwise. 
A hurdle is a special type of unoriented component. We do not define it more 
precisely, since we deal only with permutations without unoriented components, 
therefore without hurdles. See for example [5] for a complete description of what 
a hurdle is, and how to cope with hurdles when there are some. In the example 
of Figure 1, there is a single oriented component. 

The following operation establishes the correspondence between dream and 
reality in BG(ir) and BG(tt~ 1 ). Let (BG( 7r)) _1 be the graph resulting from 
applying the following transformations to BG(tt): 

— change each vertex label i a into i d and i d into i a whenever n ~ l [i\ is negative; 

— change each vertex label (ir [*])“ into i a , and (7r [?.] ) d into i d \ 

— change dream into reality and reality into dream. 

The result of such a transformation applied to the example of Figure 1 is 
given in Figure 2. 




Fig. 2. A breakpoint graph and its inverse. The correspondence of the cycles in shown. 



Lemma 1. (BG( n)) 1 = BG(ir 1 ). 

Proof. By definition, (BC?(7r)) _1 and BG( 7r _1 ) have the same vertex set. There 
is a reality edge (Ti[i]) d (n[i + l]) a in BG(n) for all i e {0, . . . , n}. In (BG( 7r)) _1 , 
it becomes a dream edge from i d if 7r [z] is positive or from i a if 7r [z] is negative, to 
(■ i + l) d if 7 x[i + 1] is positive or to (* + l) a if 7r[i + 1] is negative. This corresponds 
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exactly to the definition of a dream edge of BG(tt~ 1 ). Furthermore, there is a 
dream edge i(i + 1) in BG(tt) for all i £ {0, . . . , n}. In (BG( 7r)) _1 , it becomes a 
reality edge from (7r _1 [i]) d to (7r _1 [i + 1])“. 

Therefore, (BG( 7r)) _1 and BG( 7r _1 ) have the same sets of dream and reality 
edges. □ 

Observe that as a consequence, a cycle of n is also a cycle of 7r _1 . 

2.2 Sorting Simple Permutations 

A simple permutation is a permutation whose breakpoint graph contains only 
cycles of size 1 and 2 (called 1-cycles and 2-cycles). When dealing with simple 
permutations, we use “cycle” to mean 2-cycle, and “adjacency” to mean 1-cycle. 
Simple permutations are worth studying because of the following result. 

Lemma 2. [f] For any permutation tt, there is a simple permutation tt' such 
that d(ir) = dlrr'), and it is possible to deduce an optimal solution S for sorting 
7 r from any optimal solution S' for sorting tt' . Transformations from tt to it' , 
and from S' to S are achievable in linear time. 

Let 7r be a simple permutation. The number of (2-)cycles is denoted by c(7r). 
By Lemma 1, c(7r) = c(7r _1 ). The following is an easy but useful remark. 

Lemma 3. For any reversal p, p = p~ l , and ifirpi ■ ■ ■ Pk = Id, where pi,...,Pk 
are reversals, then tt -1 = pi ■ ■ ■ pk, tt~ x Pk • • • pi = Id, and p\ ■ ■ ■ pkir = Id. 

Hannenhalli and Pevzner [4] proved the following fundamental result, which 
is the basis of the theory for sorting by reversals. We restrict ourselves to permu- 
tations without unoriented components, because the best algorithms all begin 
with a procedure to clear unoriented components in linear time, and we have 
nothing to add to this procedure. Therefore, we suppose unoriented components 
have already been cleared. 

Lemma 4. [f] If tt is a simple permutation without unoriented components, 
d( tt) = c(tt). 

This means that any optimal solution has to decrease the number of cycles 
by one at each step. The effect of the reversal of an interval [i,j] on a breakpoint 
graph is to delete the two reality edges 7r [z] 7r [z + 1] and 7r[j]7r[j + 1], and to 
replace them by two new reality edges 7r[z](— 7r[j]) and (—7 r[z + 1])7 r[j + 1]. The 
only way to decrease the number of cycles is to apply an inversion on the two 
reality edges of a unique cycle, and to replace them by two edges parallel to the 
two dream edges of the cycle, thus creating two adjacencies. In consequence, any 
reversal of an optimal sequence is associated with a cycle (the one it breaks). 
The set of reversals of any optimal sequence is therefore in bijection with the set 
of cycles of tt. In the sequel, we use the same notation for reversals and cycles 
of tt. For instance, we consider the permutation 7r p, for p a cycle of 7r. This 
means that if the two reality edges of p are tt [z] 7t [z + 1] and tt [ j ] tt [ j + 1], then 
the associated reversal p is the reversal of the interval [i,j\. We see in the next 
section the conditions for a reversal associated with a cycle to effectively create 
two adjacencies. 
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2.3 Contraction and Removal of Cycles 

In what follows, permutations are always considered to be both without unori- 
ented components and simple (the appropriate linear time tranformations are 
described in [3-5,7]). Two edges are said to have the same orientation if they 
are both oriented or both unoriented. 

Lemma 5. In any cycle of a permutation, the two dream edges have the same 
orientation, and the two reality edges have the same orientation. 

Proof. Suppose two dream edges e, / of a same cycle had a different orientation, 
say e joins two departures, and / joins a departure and an arrival. The cycle 
would therefore have three departures and one arrival, which is impossible since 
reality edges always join one departure and one arrival. All the similar and dual 
cases are treated in the same way. □ 

A cycle is said to be contractible if its dream edges are oriented. In the 
example of Figure 1, cycle D is contractible as both (3, 4) and (6, 7) are oriented 
dream edges. 

A cycle is said to be removable if its reality edges are oriented. In the example 
of Figure 1, cycles A and C are removable because, respectively, (0,4) and (5, 1), 
(6,3) and (2,5), are oriented reality edges. Cycle B is neither removable nor 
contractible as none of its edges is oriented. 

It is straightforward from the definitions and from Lemma 1 that a cycle p is 
removable in 7r if and only if it is contractible in 7 r _1 . Indeed, if a reality edge is 
oriented in BG(tt), it becomes an oriented dream edge in BG( 7r) _1 = BG( 7r _1 ), 
and vice-versa. The following lemma is another important result in the theory 
of sorting by reversals. 

Lemma 6. [f] A cycle p of ir is contractible if and only if c(np) = c(7r) — 1. 

Observe that this does not mean that d(np) = d( tt) — 1, because np may have 
unoriented components, and in this lies all the difficulty of sorting by reversals. 
From this and from Lemmas 1 and 3, we deduce immediately: 

Lemma 7. A cycle p of n is removable if and only if c(ptt) = c(tt) — 1. 

Let p be a cycle of 7T, with reality edges (7r[f]) d (7r[i-|-l]) a and (7r[j]) d (7r[j-|-l]) a 
(suppose i < j). Let BG(n)/p be the graph obtained from BG(n) by: 

1. deleting the two reality edges of p, and replacing them by two edges parallel 
to the two dream edges of p\ 

2. changing the vertex label (7r[r]) a into (7r[r]) d and the vertex label (n[r]) d 
into (7r[r]) a for every r £ [i + 1 ,j}. 

We call this operation the contraction of p in BG(tt). 

Lemma 8. [f] If a cycle p of a permutation tt is contractible, then BG(tt)/ p = 
BG{ttp). 
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It is possible to write a similar “dual” lemma for removability. Let BG(ir)\p = 
(BGlir)^ 1 / p)~ x . We call this operation the removal of p in BG(ir). It is clear 
that it consists in deleting the two dream edges of p , replacing them by two 
edges parallel to the two reality edges of p (thus obtaining two adjacencies), and 
changing the labels r of the vertices for every r £ [| 7r [i + 1]|, |7r[j]|]. 

Lemma 9. If a cycle p of a permutation i r is removable, then BG(ir) \ p = 
BG(pir). 

The proof is straightforward from the previous Lemma together with Lem- 
ma 1. Contracting a contractible cycle in a breakpoint graph corresponds there- 
fore exactly to applying the corresponding reversal to the permutation. In a sim- 
ilar way, removing a removable cycle corresponds to applying the corresponding 
reversal to the inverse of the permutation. In other words, contracting a cycle is 
changing the reality to make it closer to your dreams, and removing a cycle is 
changing your dreams to make them closer to the reality. 





Fig. 3. Removal of a cycle in the permutation of Figure 2, and contraction of the same 
cycle in the inverse permutation. 



The following facts about contraction and removal can easily be verified. 

Lemma 10. Let p,p' be two distinct cycles in it. If p is contractible, then p' is 
removable in 7 r if and only if it is removable in irp. Conversely, if p is removable, 
p' is contractible in 7 r if and only if it is contractible in pir. Moreover, if p is 
removable in 7r, then two edges distinct from the edges of p cross in ir if and only 
if they cross in pir. 

Proof. Let {ir\r]) d {i r[r + 1])“ be a reality edge of p' in 7 r. If r is in the interval of 
the reversal p, then ( ir[r + l]) d (7r[r]) a is a reality edge in irp, and both 7r[r] and 
7r[r + 1] changed sign. Otherwise, (7r[r]) d (7r[r + l]) a is still a reality edge in irp, 
and 7r[r] and 7r[r + 1] have same sign. In both cases, the edge preserves the same 
orientation. Consequently, p' is removable in 7 r if and only if it is removable in 
irp. The dual statement is deducible from Lemmas 1 and 9. 

Finally, we prove the last statement: let i(i + 1) and j(j + 1) (suppose 
|7T — 1 [z] | < |7T — 1 [ 7 ] |) be two dream edges of n. Let i'(i! + 1) and j'{j' + 1) be 
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the same dream edges in pn, that is after the deletion of the cycle p in 7r (labels 
of the vertices may have changed). The edge i(i + 1) corresponds to the real- 
ity edge 7r -1 [i]7T -1 [i + 1], and the edge i'(i' + 1) corresponds to the reality edge 
(/07r) — 1 [z / ]( / 07r) — 1 [i' + l] = 7r” 1 p[t , ]7r _1 / o[t , -|-l]. However, the numbers of the labels 
of the vertices do not change when contracting a cycle (see Lemma 8), therefore 
we have that |7r -1 p[i , ]| = |7r _1 [i]| and \i i~ 1 p[i' + 1] | = \ir ~ 1 {i + 1]|. The same 
applies with j. This means that |7r — 1 [z] | < |7r — 1 [j] | < 1 7r — 1 [z -|- 1] | < | zr — 1 [ j + 1] | 
if and only if | (pzr) ~ 1 (i 7 ) I < l(P 7r ) _1 (j')l < + 1)1 < I + 1)1, 

which proves that the dream edges cross in pn if and only if they cross in tt. □ 



3 Constructing the Sequence of Reversals 

We can sort each oriented component of tt separately (again, see [4]). We there- 
fore suppose without loss of generality that there is only one component in 
BG(tt). We call a valid sequence of a permutation tt an ordering of a subset of 
its 2-cycles pi, . . . , pk, such that for all i € {1, ... , k}, p, is a removable cycle of 
Pi - 1 . . . p\TT. In other words, pi,. . . ,Pk is valid if c(pk • • • P\tt) = c(tt) — k. 

A valid sequence is said to be maximal if no cycle of pk • • ■ Pitt is removable. 
It is total if k = c(7r), that is if pk • • • piir = Id. 

The algorithm for sorting a simple permutation is the following: 

1. compute a maximal valid sequence of zr; 

2. increase the size of the sequence by adding some cycles inside it while it is 

not total. 

Observe that the above algorithm constructs a sequence of removable instead 
of contractible cycles. The latter would seem to be a more direct way of doing 
the same thing. It is probably possible to do so by directly contracting instead 
of removing cycles (that is, by sorting tt instead of 7 r _1 ). However, we believe 
the proofs are much easier in this case because the structure of the breakpoint 
graph is much more stable after a removal than after a contraction. 

The first step of the algorithm consists in simply detecting removable cycles 
and removing them while there exist removable cycles in the result. We now 
concentrate on the second step, which consists in, given a maximal valid sequence 
pi, . . . , pk, adding some cycles to it. 

We prove the following result: 

Theorem 1 . If S is a maximal but not a total valid sequence of reversals for 
a permutation tt , there is a nonempty sequence S' of reversals such that S may 
be split into two parts S = Si, S 2 , and Si, S", S 2 is a maximal valid sequence of 
reversals for tt. 

Proof. Let S = pi, . . . , pk be a maximal valid sequence of tt. Let C be the set of 
cycles of the permutation pk ■ ■ ■ pin (it is composed of all the cycles of tt minus 
the cycles in S) . The set C is a union of unoriented components (all reality edges 
are unoriented else there would remain a removable cycle, therefore all have 
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same orientation making all dream edges unoriented). Since there is only one 
component in BG( n), one cycle of S has to cross one cycle of C. Choose such a 
cycle pi in S, such that l is maximum for this property. Let Si = pi,. . . , pi_\ and 
S 2 = pi,...,pk- These will be the way of splitting S in two, as described in the 
theorem. Let n\ = pi-\ ■ ■ ■ pin (this is the permutation obtained by removing 
the cycles of S 1 in n). 

We first prove some lemmas. 

Lemma 11. At least one cycle of each component ofC crossed by pi is removable 
in 7Ti . 

Proof. Since a removal does not change the orientation of the dream edges 
(Lemma 10), and C is unoriented in pk---p\n 1 no cycle of C is contractible 
in 7T or in 7Ti. Let p be a cycle of C intersecting pi in n\. Let 7Ti [ j] 7Ti [j'] ( j < j') 
be a dream edge of p crossing pi. Choose a reality edge ni[i]ni[i + 1] of pi such 
that j < i and j' > i + 1. Since ni[j]ni[j'\ is unoriented, n\[j] and 7Ti[j'] have 
the same sign. Since pi is removable in n lt the reality edge ni[i]ni[i + 1] is 
oriented and ni[i] and n\[i + 1] have opposite signs. Among the reality edges 
(ni[r]) d (ni[r + l]) a for r £ \j,j r ], there is therefore a positive even number of 
oriented ones (there has to be an even number of changes of sign, and at least 
one between i and i + 1). Thus at least one removable cycle distinct from pi has 
to cross p. By the maximality of l , this cycle is in C. □ 

Let S' = /I),..., p' p be a valid sequence of cycles of C in 7Ti , such that C \ 
{p'i ■■■ p' p } contains no removable cycle in p' p . . . p\n\. 

Lemma 12. The cycle pi is removable in p p - ■ ■ p\ n-[ . 

Proof. Let n' = p' p _ 1 ■ ■ ■ p[ni. Let n'[i]n'{i+T\ and n'\j]n'\j+ 1] be the two reality 
edges of p p in n' . Let M be the number chosen among 7r'[i], n’[i+ 1], n’\j], n'[j+ 1] 
with highest absolute value, and m be the number with lowest absolute value. 
Since p' p is removable in 7r', n'[i ] and n'[i + 1] have opposite signs, and n '[j] and 
n'[j + 1] also have opposite signs. Recall that no cycle of C is contractible in n, 
so none is in n ' , from Lemma 10. Then the two dream edges 7r'[i]7r , [j + 1] and 
7r, [j] 7r, [* + l] °f Pp are unoriented. In consequence n'[i\ and n'\j + 1] have the same 
sign, and n' [j\ and n' [i + 1] also have the same sign. Note that n' [j + 1] = n' [i] + 1 
and n'[i + 1] = n'[j\ + 1, so M and m cannot be adjacent in the graph, and 
they have opposite signs. Now remove cycle p p from n' . Removing a cycle is 
deleting the dream edges and replacing them by two edges parallel to the reality 
edges, and changing the labels of the vertices of the graph whose absolute values 
are between |m| and |M|, excluding m and M themselves (see Lemma 9). In 
the obtained graph, both M and m participate in two adjacencies, and they 
have opposite signs. In consequence, there is an odd number of changes of signs 
between M and m, therefore an odd number of oriented reality edges. Since 
oriented reality edges belong to a removable 2-cycle (they only go by pairs), at 
least one of these 2-cycles crosses p' p in n' . Since there is no removable cycle of 
C in p p n' (by hypothesis), and pi is the only cycle outside C that may cross a 
cycle of C, this removable cycle is pi. □ 
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Lemma 13. The sequence Si,S',S 2 is a valid sequence of n. 

Proof. We already know that pi, . . . , pi-i, p[, .. . ,p' p ,pi is a valid sequence, by 
Lemma 12. Let now 7Ti = pi ■ ■ ■ pin , and n 2 = pip' p ■ ■ ■ PiPi-i ■ ■ ■ pin. Let D i be 
the component of BG(tti) induced by the cycles pi+i, ■ ■ ■ , Pk in 7Ti. Since each 
component of a breakpoint graph may be sorted separately, we can say that 
pi+i, ■ ■ ■ , Pk is a total valid sequence for D i. Let now D 2 be the component of 
BG{m) induced by the cycles pi+i,...,pk in ir 2 . A cycle pi is contractible in 
Di if and only if it is contractible in D 2 , and two cycles cross in D i if and only 
if they cross in D 2 (they differ only by some removals, which do not change 
contractibility, nor “crossingness” ) . Then pi + i,...,pk is a total valid sequence 
for D 2 . Finally, pi,..., pi-i,p[, . . . , p' p , pi, . . . , p k is valid in ir. □ 

This concludes the proof of Theorem 1 because we have a valid sequence 
Si, S', S 2 , and S' is non empty (see Lemma 11). □ 

In the example of Figure 1, with the labels of the cycles as indicated in 
Figure 2, the algorithm may work as follows: cycle C is removed, then cycle 
D. The sequence C, D is maximal since A and B are not removable in DCn. 
It is not total, so we must find the last cycle in the ordered sequence {C, D} 
which crosses a cycle in {A, B}. This is cycle C which crosses B (C is thus pi, 
Si = 0 and S 2 = {C, D}). In BG(n) \ {C, D}, only A is removable in n. We 
therefore remove A. This makes B removable and we remove it. There are no 
more removable cycles in BG(n) \ { C , D} (indeed, there are no more cycles), so 
A , B , C,D is a total valid sequence and DCBAn = Id. 

Observe that in the case of this example, by working directly on 7r -1 , it is 
easy to find that D,C,B, A is a total valid sequence. 

4 Complexity 

With a classical data structure, applying a reversal and picking a contractible (or 
removable) cycle is achievable in linear time. As a consequence, any algorithm 
that has to apply a reversal at each step, even bypassing the search for a safe 
reversal, will run in 0(n 2 ). This is the case for our algorithm. In practice, even 
an 0(n 2 ) implementation should be an improvement on the existing algorithms. 
For small permutations, such quadratic implementation is even probably better 
than the use of another data structure. However, it is an important mathematical 
question whether sorting a signed permutation by reversals can be done in a 
theoretical subquadratic time complexity. We therefore use a more sophisticated 
implementation to make it run exactly in 0(ny/nlogn). 

In 2003, Kaplan and Verbin [6] invented a clever data structure which allows 
to pick a contractible reversal and apply it in sublinear time. We use the same 
data structure, just adding some flags in order to be able to perform all the 
additional operations we need. Furthermore, we use the structure to represent 
the permutation ir _1 instead of n. Indeed, we are looking for removable cycles 
in 7 r, that is for contractible ones in 7r -1 . As in [6], we split the permutation into 
blocks of size 0(y/n log n). We assign a flag to each block, turned “on” if the 
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block should be read in the reverse order, changing the sign of the elements. We 
also add to each block a balanced binary tree (such as a splay tree for instance) , 
where the elements of the blocks are stored in the nodes, sorted by increasing 
position of their successor in tt~ 1 . This means that 7r^ 1 [i] is before 7r _1 [j] if 
7r ( 7r_1 W + 1) < 7r ( 7r_1 [j] + !)• Each node of the tree corresponds to a dream 
edge of 7 t — 1 (each 2-cycle is represented by two nodes). Each node keeps the 
information on the orientation of the associated dream edge (that is, on the 
contractibility and therefore also on the removability in n of the cycle in which 
the edge participates), and a “running total” storing the number of oriented 
edges in the subtree. 

Up to this point, we have described exactly the data structure of Kaplan and 
Verbin. We now add some flags and numbers to the nodes of the trees in order 
to perform our own queries. Let C be a subset of the 2-cycles of 7r _1 . At the 
beginning, C is all the 2-cycles in BG(ir). To each node, a new flag is added, 
turned “on” if the corresponding edge is in a cycle of C, and a “running total” 
stores the number of oriented edges that belong to C in the subtree. Figure 4 
gives a representation of the data structure applied to the inverse permutation 
of Figure 2 (right side), split in two blocks. Observe that we do not need to know 
the number of all oriented edges, only of those which are in C. 



0 


7 
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3 



-1 


-6 


-2 


4 





Fig. 4. The data structure for the permutation of Figure 2 (right side) at the beginning 
of the algorithm (all edges are in C), if the permutation is split into two blocks. 



In this way, the following queries are achievable in the same way, and same 
complexity, as in [6]. 

Lemma 14. [6] It is possible to know in constant time whether there is a re- 
movable cycle in n, and to pick one such cycle in time O(logn). 
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With the additional flags, and proceeding exactly in the same way, we have 
that: 

Lemma 15. It is possible to know in constant time whether there is a removable 
cycle of n in C, and if there is any, to pick one such cycle in time O(logn). 

Lemma 16. It is possible to contract a cycle and maintain the data structure 
in time 0(y/n log n). 

Proof. All we have to maintain as information besides what was done in [6] is 
whether an edge is in C. In order to do this, we just need to remove an edge from 
C when it is contracted and then to update the running totals. The number of 
elements of C is thus decreased by one in all the blocks containing the contracted 
cycle, and we calculate the difference between the number of elements in C and 
the former number of contractible cycles in C at each flipped block. □ 

Figure 5 represents the data structure at a later step in the algorithm. As 
described before, the algorithm picks cycles C and D , removes them, and since 
there is no other removable cycle, replaces them. The result of this is shown in 
the figure. 
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Fig. 5. The data structure for the permutation of Figure 2 (right side) at a later step 
of the algorithm, when cycles C and D have been removed, and then replaced (but 
remain removed from C). 



Theorem 2. There exists an algorithm which finds a total sorting sequence in 
time 0(ny/n log n). 

Proof. We first find a maximal valid sequence pi,...,pk- This takes 0{k^/n\og n) 
time. If the sequence is total, we are done. Otherwise, we are left with a set C 
of unremoved cycles. 
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We must then identify pi in the sequence such that there is a removable 
cycle in C in pi-i . . . p\ tt, with l maximum for this property. This can be done 
by putting back (applying the reversals again) one by one the removed cycles 
Pk, ... ,pi, checking at each step (in constant time) if there is a removable cycle 
in C, and stopping as soon as one is found. This is achievable in time 0((k — 
l)\/n log n). The sequence Pk, - ■ ■ ,pi is a sequence of safe contractible reversals 
for 7T _1 , and is fixed once for ever. It will represent the last reversals of the final 
sequence. It is not modified nor looked at anymore in the algorithm, and this 
allows to maintain the complexity. 

We then have to remove, while there are any, removable cycles in C. This leads 
to the sequence Pk+l, ■ ■ ■ , Pk' which we insert after the sequence pi , . . . , pi-i, 
while at the same time deleting it from C in the same way we have already seen. 
If pi , . . . , pi- 1 , pk+ 1 , . . . , pk' , pi, ■ ■ ■ , pk is total, we are done. Otherwise, we start 
again replacing the removed cycles one by one in the reverse order but starting 
this from k' (the sequence pi, . . . ,pk is not touched anymore) . We do so until 
there is a removable cycle in the new C, and start again while C is not empty. 

Each cycle in the sequence is removed once, and replaced at most once in the 
whole procedure. The total complexity is then Ofn^Jn logn). □ 

The remaining bottleneck of the new algorithm is now the complexity at each 
step of applying a reversal to a permutation and of keeping the set of contractible 
or removable cycles. This is what has to be improved in the future if one wishes 
to obtain a lower theoretical complexity for sorting a signed permutation by 
reversals. 
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Abstract. The perfect phylogeny model for haplotype evolution has been suc- 
cessfully applied to haplotype resolution from genotype data. In this study we 
explore the application of the perfect phylogeny model to other problems in the 
design and analysis of genetic studies. We consider a novel type of data, xor- 
genotypes, which distinguish heterozygote from homozygote sites but do not 
identify the homozygote alleles. We show how to resolve xor-genotypes under 
perfect phylogeny model, and study the degrees of freedom in such resolutions. 
Interestingly, given xor-genotypes that produce a single possible resolution, we 
show that the full genotype of at most three individuals suffice in order to de- 
termine all haplotypes across the phylogeny. Our experiments with xor- 
genotyping data indicate that the approach requires a number of individuals 
only slightly larger than full genotyping, at a potentially reduced typing cost. 
We also consider selection of minimum-cost sets of tag SNPs, i.e., polymor- 
phisms whose alleles suffice to recover the haplotype diversity. We show that 
this problem lends itself to divide-and-conquer linear-time solution. Finally, we 
study genotype tags, i.e., genotype calls that suffice to recover the alleles of all 
other SNPs. Since most genetic studies are genotype-based, such tags are more 
relevant in such studies than the haplotype tags. We show that under the perfect 
phylogeny model a SNP subset of haplotype tags, as it is usually defined, tags 
the haplotypes by genotype calls as well. 

Keywords: SNPs, Haplotypes, Perfect Phylogeny, Tag SNPs, Graph Realiza- 
tion. 



1 Introduction 

1.1 Background 

Genetic information in nature is usually stored as a linear sequence, written in a mo- 
lecular DNA alphabet of four letters (nucleotides), A, C, G and T. Higher organisms 
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are diploid, i.e., have two near-identical copies of their genetic material arranged in 
paired molecules called chromosomes, one originating from each parent. Such chro- 
mosomes are homologous, that is, contain essentially the same genes in altered vari- 
ants. Changes between variants comprise mainly of Single Nucleotide Polymorphisms 
(SNPs), i.e., sequence sites where one of two letters may appear [1]. These SNPs are 
numerous and it is estimated that any two homologous human chromosomes sampled 
at random from the population differ on average once in every thousand letters, ac- 
counting thus for a few million such differences along the entire genome. The variants 
of a SNP are called alleles. An individual is said to be homozygous for a SNP if both 
homologous chromosomes bear the same allele for this SNP and heterozygous other- 
wise. The sequence of alleles along a chromosome is called a haplotype. At first ap- 
proximation a chromosome can be considered as a patchwork of haplotypes along its 
length. A genotype along homologous chromosomes lists the conflated (unordered 
pair of) alleles for each SNP (see Fig. 1). 



AGTG AC 


b) 


1 2 3 4 5 6 


c) 


AATG AC 




A G/A T G T/A C 




AATGTC 






AGTGTC 









Fig. 1. An example of 6 SNPs along two homologous chromosomes of an individual, (a) This 
individual’s haplotypes. (b) This individual's genotype. Here the Xor-genotype (set of het- 
erozygous SNPs) would be {2,5}. (c) Another potential haplotype pair giving rise to the same 
genotype. 

Both genotype and haplotype data are used in genetic studies. Haplotypes are often 
more informative [6]. Unfortunately, current experimental methods for haplotype 
determination are technically complicated and cost prohibitive. In contrast, a variety 
of current technologies offer practical tools for genotype determination [26]. Given 
genotype data, the haplotypes must be inferred computationally, in a process called 
resolving, phasing or haplotyping [7,8,9,10,11,27], A single genotype may be re- 
solved by different, equally-plausible haplotype pairs (see Fig. 1), but the joint infer- 
ence of a set of genotypes may favor one haplotype pair over the others for each indi- 
vidual. Such inference is usually based on a model for the data. Informally, most 
models rely on the observed phenomenon that over relatively short genomic regions, 
different human genotypes tend to share the same small set of haplotypes [2,3]. 



1.2 The Perfect Phylogeny Model 

During sexual reproduction, only one homologous copy of each chromosome is trans- 
mitted to the offspring. Moreover, that copy has alternating segments from the two 
homologous chromosomes of the parent, due to a segmental exchange process called 
imeiotic) recombination. Studies have shown that recombination occurs mainly in 
narrow regions called hotspots. The genomic segments between hotspots are called 
blocks. They show essentially no recombination [4] and their haplotype diversity is 
limited [2,3]. Within blocks, haplotypes evolve by mutations, i.e., replacement of one 
nucleotide by another at particular sites (other mutation types are not discussed here). 
Since mutations are relatively rare [5], it is often assumed, as we do here, that at most 
one mutation occurs in each site. The perfect phylogeny model for haplotype block 
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evolution assumes that all haplotypes in the population have a common ancestor, no 
recombination and no recurrent mutation. 

The Perfect Phytogeny Haplotyping problem (PPH) seeks to infer haplotypes that 
satisfy the perfect phylogeny model (we defer formal definitions to Section 1.5). PPH 
was first introduced by Gusfield [9], who presented an almost linear solution by re- 
ducing PPH to the classical Graph Realization problem. Simpler, direct solutions 
were later given [10,1 1], which take 0(nm 2 ) for n haplotypes and m SNPs. 



1.3 Informative SNPs 

Many medical genetics studies first determine the haplotypes for a set of individuals 
and then use these results to efficiently type a larger population. Having identified the 
restricted set of possible haplotypes for a region, the identity of a subset of the SNPs 
in the region may suffice to determine the complete haplotype of an individual. Such 
SNPs are called tag SNPs, and typing them alone would lose no information on the 
haplotypes. More generally, we may be interested only in a subset S of all SNPs (e.g., 
coding and regulatory SNPs only) but can use all available SNPs to tag them. In this 
setting we call S the set of interesting SNPs, and seek a smallest possible informative 
SNP set, i.e., is a subset of all SNPs that captures all the information on S (see Fig. 2). 
Hence, the identification of few informative SNPs may lead to substantial saving in 
typing costs. For this reason, the computational problems of finding a minimal tag (or 
informative) set have been studied [2,13,18]. 



o. 

>. 

B 

Q. 
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I 

Fig. 2. Tag SNPs and informative SNPs. The set {1,2} is a tag SNP set. If {9,10,11} is the 
interesting SNP set, then the interesting set distinguishes the haplotypes 1. 2 and {3,4}, but 
does not distinguish between haplotypes 3 and 4. Therefore {1,2} and {6,8} are both informa- 
tive SNPs sets but {4,5} and {2,3} are not. Notice that the same genotype A/C T/A is obtained 
for the tag SNP set {1,2} from the two pairs of haplotypes {1,2} and { 3,4 } . 




Finding the minimum set of tag SNPs within an unconstrained block is NP-hard 
[12], When the perfect phylogeny model is assumed, in the special case of a single 
interesting SNP, a minimal set of informative SNPs was shown to be detectable in 
0{nni) time, for n haplotypes and m SNPs [13], 



1.4 Contribution of This Work 

We study here several problems arising under the perfect phylogeny model during 
genetic analysis of a region, along the process from haplotype determination toward 
their utilization in a genetic study. Our analysis focuses on a single block. 
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Some experimental methods such as DHPLC [14] can determine whether an indi- 
vidual is homozygous or heterozygous for each SNP, but cannot distinguish between 
the two homozygous sites. Typing SNPs in such manner will provide, for each indi- 
vidual, a list of the heterozygous sites, which we refer to as the individual’s xor- 
genotype. Xor-genotypes are less informative than regular ("full") genotypes; but their 
generation may be less costly. Therefore, it is of interest to infer the haplotypes based 
primarily on xor-genotypes instead of full genotypes. In Section 2 we introduce the 
Xor P effect Phytogeny Haplotyping problem (XPPH), study the limitations of using 
only xor-genotypes, and the additional genotype information required. Section 2.2 
presents an efficient solution to XPPH based on the graph realization problem [15]. 
We implemented our solution and evaluated the XPPH strategy in Section 2.3. Our 
tests show that the method compares favorably with standard genotyping. 

Section 3 studies informative SNPs under the perfect phylogeny model. We gener- 
alize the minimum informative set (and tag set) problems by introducing a cost func- 
tion for SNPs, and seek minimum cost sets. The cost is motivated by the facts that 
typing some SNPs may be technically harder (e.g., those in repetitive or high GC 
regions), and that some SNPs are more attractive for direct typing (e.g., protein- 
coding SNPs, due to prior assumptions on their functionality). In section 3.2 we find 
minimal cost informative SNP sets in O(m) time for any number of interesting SNPs, 
when the perfect phylogeny tree is given. This generalizes the result of [13]. Section 
3.3 discusses a practical variant of the tag SNPs set, i.e., the phasing tag SNPs set: As 
we usually have only genotypic (conflated) information on the SNPs, a practical goal 
would be to find a set of SNPs that give the same information as tag SNPs, but instead 
of knowing their haplotype we only know their genotype. We prove that the set of 
informative SNPs is guaranteed to have this quality, and that this is guaranteed only 
under the perfect phylogeny model. 

We conclude with a discussion in Section 4. Throughout the manuscript, many 
proofs are omitted, due to lack of space. 



1.5 Preliminaries 

We denote the two alleles for each SNP by 0 and 1. A haplotype is represented by a 
binary vector. A set of haplotypes is represented by a binary matrix H, where each 
row is a haplotype vector and each column is the vector of SNP alleles. We denote the 
allele of haplotype i for SNP j by H or by h - for the haplotype h=H j . A genotype is 
the conflation (mixing) of two haplotypes. For example, the pair of haplotypes 00100 
and 10001 gives rise to the genotype [0,1 } [0,0] [0,1 } [0,0] [0,1 }. 

The perfect phylogeny model is formalized as follows: Let H be a binary matrix 
of n distinct haplotypes over m SNPs. A perfect phytogeny for H is a pair (T,f) where 
T=(V,E) is a tree with [1 ,...,n}c.V and /:{ l,...,m}— >E is an assignment of SNPs to 
edges such that (1) every edge of T is labeled at least once and (2) for any two rows k, 
/, H k rfH j- iff the edge f(j) lies on the unique path from node k to node l in T. The 
analysis of this model is heavily based on a fundamental property (cf. [21]): 

Theorem 1: There is a perfect phylogeny for H iff H does not contain a 4x2 subma- 
trix in which all four rows are different. Such a submatrix is called a four-gamete. 
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2 Xor Haplotyping 

In this section we formally define the problem of Xor Perfect Phylogeny Haplotyping, 
provide an algorithm for the problem and discuss how to handle ambiguity in the data. 
We then show how to obtain the actual haplotypes efficiently using a small amount of 
additional full genotypes. 



2.1 Problem Definition 

Definition: A xor-genotype of a haplotype pair { h,h ' } is the set of their heterozygote 
SNPs, i.e., {s\h^h’ s } (see Fig. 1). A set of haplotypes H explains a set of xor- 
genotypes X if each member of X is a xor-genotype of a pair of haplotypes in H. 

Problem 1: Xor Perfect Phylogeny Haplotyping (XPPH) 

Input: A set X=[X I X n ] of xor-genotypes over SNPs S={s 1 ,...,s }, such that 

XjU...uX (| =5. 

Goal: Find a haplotype matrix H with a perfect phylogeny (7’./), such that H explains 
X, or determine that no solution exists. (See Fig. 3) 

Hereafter, we shall omit the term “or determine that no solution exists” from prob- 
lem statements for brevity. This requirement is part of the goal of all the algorithms in 
this study. 
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Fig. 3. An example of a solution to XPPH. (a) X -The four xor-genotypes. (b) H - The inferred 
haplotypes. (c) (Tf) - A perfect phylogeny for H. Notice that H explains X by taking the haplo- 
type pairs { 1,4), {3, 4), {2, 3} and {1,3}. Note that T includes a haplotype (0000) that is not in H. 



2.2 An Algorithm for XPPH 
2.2.1 Inference up to Bit Flipping 

A first step in understanding XPPH is the observation that the solution is never 
unique. Rather, flipping the alleles of SNPs in a solution yields yet another solution, 
as we now explain. 

Definition: Two haplotype matrices H nxm and H’ nxm are equivalent up to bit flipping 
(denoted //<->//') if for any two rows k , /, H k j^H ^ <=> H’ k j£H’ ,.. H<tAH’ iff one can be 
obtained from the other by exchanging the roles of 1 and 0 for some columns. Notice 
that » is a set-theoretic equivalence relation. 
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Observation 1: If I !<->!!' then X can be explained by H iff it can be explained by H’. 

Observation 1 implies that XPPH can only be solved up to bit flipping based on the 
data given by X. 

In some cases, however, there may be several alternative sets of haplotypes that 
explain X and are not ^-equivalent. In that case, we will not be able to determine 
which of those sets is the one that really gave rise to X. Our only achievable goal is 
therefore to identify when the solution obtained is guaranteed to be the correct one. 
We will next show that this is guaranteed by the uniqueness of the solution. The 
analysis relies on the following property of perfect phylogenies: 

Key property: Let (Tf) be a perfect phylogeny for H. If // =() then for all k, H k j= 0 iff 
nodes i and k are in the same component of T\f(j). 

Definition: (Tf is a perfect phylogeny for X if (Tf is a perfect phylogeny for some 
H that explains X. 

Proposition 1: When X has a unique perfect phylogeny then the haplotype matrix that 
explains it is unique up to bit flipping (i.e., up to ^-equivalence). 

Proof: It suffices to prove that if ( Tf) is a perfect phylogeny for H then there is no H’ 
such that ( Tf) is a perfect phylogeny for H’ and —i(H<r J >H’). First we observe that 
there is a unique correspondence between the nodes of T and the haplotypes in H. 
This correspondence is obtained as follows. We first identify the haplotype heH of an 
arbitrary leaf v. This is done by observing the SNPs that correspond to the edge 
incident on v. h is the only haplotype that is distinct from all others in these SNPs. 
The haplotypes of all other nodes are now readily determined by the key property. 
This generates the uniuqe correspondence. The actual haplotypes are set by fixing 
arbitrarily the haplotype vector of one node and setting all others according to the key 
property. □ 

Proposition 1 motivates a new formulation of Problem 1 ’ : 

Problem 1': XPPH: 

Input: A set X={X l ,...,X n ] of xor-genotypes over SNPs S={s l ,...,s m ], such that 

XjU...uX (| =5. 

Goal: Find a unique perfect phylogeny ( Tf) for X, or determine that there are multiple 
perfect phylogenies for X. 

Proposition 1 implies that a solution to Problem V (if unique) is guaranteed to be a 
perfect phylogeny for the correct set of haplotypes, i.e., the haplotypes that actually 
gave rise to X. 

Gusfield's work [9] leads to an interesting and useful connection between xor- 
genotypes and paths along the perfect phylogeny tree, as follows. 

Definition: We say that a pair (Tf) realizes X0 S if X t is the union of edge labels that 
constitute a path in T. ( Tf) is said to realize a collection X of subsets if each X ; e X is 
realized by (Tf). 

Proposition 2: ( Tf) is a perfect phylogeny for X iff X is realized by (Tf). 
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The following formulation for XPPH is finally obtained: 

Problem 1": XPPH: 

Input: A set X={X v ...,X n } of xor-genotypes over SNPs .S = { .v p . . . ,s } , such that 
XjU.-.uX,, =5. 

Goal: Find the unique realization (T x f) for X, or determine that there are multiple 
realizations for X. 

Intuitively, the larger the number of typed SNPs, the greater the chances to have a 
unique realization. Occasionally, however, a dataset X may have multiple realizations 
even with many SNPS.. This is the case of the data including xor-equivalent SNPs: 

Definition: We say that s,s'eS are xor-equivalent w.r.t. X and write s~ x s' if for all i: 
se X^s'e X r 

Fortunately, xor-equivalent SNPs may be redundant. This is the case of the data in- 
cluding haplotype-equivalent SNPs: 

Definition: We say that s,s'eS are haplotype-equivalent w.r.t. H and write .v= H .v' if for 
all i,i: H All <=>//. ,aH ,. Note that ~ H and ~ x are equivalence relations. 

Observation 3: Haplotype-equivalence implies xor-equivalence but not vice versa. 
(See Fig 4). 
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Fig. 4. Xor-equivalence and haplotype-equivalence. (a) X - The xor-genotypes. (b) H - The 
haplotypes matrix. Haplotypes 1 and 4 form the first xor-genotype, and haplotypes 2 and 3 
form the second. The pairs of xor-equivalent SNPs are { 1, 4} and {3, 5), while only 3 and 5 are 
haplotype-equivalent. (c) (Tj) - A realization for X that is a perfect phylogeny for H. (d) (T’f) 
- Another realization for X that is not a perfect phylogeny for H. 
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We next show that haplotype-equivalent SNPs are redundant. 

Notation: Denote by S H c:S the set that is obtained by taking one representative from 
each haplotype-equivalence class. Denote by H H the haplotype matrix that is obtained 
by restricting H to S H . 

Observation 4: (1) To obtain a perfect phylogeny (T.f) for H , one can obtain a perfect 
phylogeny (7’/’) for H H and then set f(s)=f(s H ) for every se S that is haplotype- 
equivalent to s H . (2) (T/’) is a unique perfect phylogeny for H H since S H contains no 
haplotype-equivalent SNPs. 

Observation 4 implies that haplotype-equivalent SNPs are redundant, hence may 
be merged to label a single edge in ( T,f) (See Fig 4c); and by doing so, we discard the 
degrees of freedom that are due to haplotype-equivalent SNPs. 
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However, identifying haplotype-equivalent SNPs is not trivial when we only have 
xor-genotype information, which as Observation 3 implies may not suffice. In other 
words, the closest we can get to merging haplotype-equivalent SNPs is merging the 
xor-equivalent SNPs, which by Observation 3 may lead to information loss (See 
Fig 4d). 

Definition: Denote by S x c:S the set that is obtained by taking one representative from 
each xor-equivalence class. Denote by X x the xor-genotypes that are obtained by 
restricting X to S x . X x is called the canonic version of X. 

We show next that when the canonic version of X has a unique realization, then 
there was no information loss in merging xor-equivalent SNPs, since xor-equivalence 
implies haplotype-equivalence in this particular case. 

Theorem 2: Let (TJ') be a unique realization for X x . Extent the mapping/’ to S by 
setting f(s)=f(s x ) for every s that is xor-equivalent to s x . Then ( TJ) is a perfect phy- 
logeny for the correct haplotype matrix that gave rise to X. 

Proof: By Proposition 2, (TJ’) is a unique perfect phylogeny for X x , and by Proposi- 
tion 1 it is a perfect phylogeny for the correct haplotype matrix on S x . We will next 
show that in the special case where (TJ’) is unique, xor-equivalence implies haplo- 
type-equivalence for the data set X. Then, by Observation 4, ( TJ) is a perfect phylog- 
eny for the correct haplotype matrix that gave rise to X. Suppose to the contrary that 
SNPs ,y , ,,s 2 e S are xor-equivalent but not haplotype equivalent. Consider the unique 
perfect phylogeny (I 5 / 5 ) of H H . Since and s 2 are not haplotype-equivalent they 
label distinct edges, e j and e 2 respectively, in 7 s . Notice thatf~ l (e l )'Uf~ 1 (e 2 ) are xor- 
equivalent. Let (T s l ,f l ) be obtained from (T 5 / 5 ) by contracting e x (identifying e/s 
nodes), and by taking f x (s)=e 2 for sef A (e l ). ( T S 1 J S ,) is similarly obtained from (7/ f) 
by contracting e 2 Then both (T s l J s l ) and (T S 2 J S 2 ) realize X x , and (T s l J s l )^(T s 2 J s 2 )-, in 
contradiction to the uniqueness of (7/’). □ 

The formulation of Problem 1" leads to a connection between XPPH and the graph 
realization problem: 

Problem 2: The Graph Realization Problem (GR) 

Input: A collection P= { // } of subsets, P l ,...,P n <zS. 

Goal: Find a pair (TJ) that realizes P. 

Observation 2: Problem 1" is now exactly the graph realization problem (when re- 
stricting the solution to GR to be unique). 

The graph realization problem was first defined in matroid theory by Tutte [16], 
who proposed an algorithm of 0(mn 2 ) time, where \P\=m and |S|=n. Gavril and Ta- 
mari [17] subsequently solved it in time 0(m 2 ri). Later, Bixby and Wagner [15] pre- 
sented an 0(a(m,n)mn) time algorithm, (a(m,n) is the inverse Ackermann function, 
u(m,n)<4 for all practical values of m,n). All three algorithms required linear space. 
These algorithms determine the existence of a graph realization and also the unique- 
ness of such a solution, hence they can be applied to solve XPPH. 
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The above discussion implies that the following procedure solves XPPH: Let M be 
the incidence matrix of X and S, i.e., M„= 1 iff .v,g X ; . Find S x and X x . (This can be 
done by a radix-sort of the columns of M in 0(nm) bitwise operations.) Then solve the 
graph realization problem on X x . If the solution is unique it implies a perfect phylog- 
eny for X. 

In case that the xor-genotypes data cannot be augmented and there are several solu- 
tions to the GR problem, we may wish to choose one of them as a perfect phylogeny 
for X. Additional considerations may help in the choice [9], We have developed a 
method for efficiently representing all the solutions for the graph realization problem 
by extending the algorithm in [17], This representation is intuitive and implementa- 
tion is straightforward. Details are omitted in this abstract. 

2.2.2 Assigning Actual Haplotypes 

In the previous section we concluded that even when XPPH has a single solution, the 
assignment of haplotypes to the tree nodes can be done only up to bit flipping. In 
order to obtain a concrete assignment, the input data must be augmented by additional 
genotyping of a selected set of individuals. We will prove that it suffices to fully 
genotype at most three individuals, and show how to select them. First, we explain 
how the additional genotype data are used to resolve the haplotypes. Denote by G ; the 
genotype of individual i (whose xor-genotype is X : ). Hereafter, we consider only those 
individuals with X^0. 

Problem 3: Haplotyping on the Tree 

Input: (a) A collection of non-empty xor-genotypes X\ (b) a perfect phylogeny (Tf) 
for X, which is unique up to haplotype-equivalent SNPs; and (c) complete genotypes 
of the individuals {i l ,...,i p }. 

Goal: Infer the haplotypes of all the individuals. 

Haplotyping across the tree is based on the above key property, which determines 
the alleles of a SNP j for all haplotypes, based on its allele in some particular node. 
More specifically, all those alleles are determined given a genotype G ( , homozygote 
for SNP j, whose haplotypes correspond to identifiable nodes in T. Consequently, G ( 
resolves the bit-flip degree of freedom for each SNP .vg S\X j . Hence: 

Proposition 3: The haplotypes can be completely inferred by G l ,...,G p iff 
X l r\...nX=0. 

The proposition brings about a criterion by which individuals should be selected for 
full genotyping. It motivates the following set-selection problem: 

Problem 4: Minimum Tree Intersection (MTI) 

Input: A collection of sets X={X l ,...,X ll ] and a perfect phylogeny (T.f) for X. 

Goal: Find a minimum subset of X whose intersection is empty. 

Note that the prefect phylogeny condition here is crucial: Without the condition that 
each X, is a path in the tree, the problem is equivalent to the NP-hard set-cover prob- 
lem. 
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Theorem 3: If X 1 n...nX ) =0 then there is a minimum tree intersection set of size at 
most 3. 

Proof: Consider the path X v and w.l.o.g. label the SNPs according to their order 
along that path as (1, ...,&). For each i, the set X l nX- defines an interval in that order. 
If X l nX j =0 for some i then \X l ,X j ) are a solution. Otherwise all intervals overlap X v 
Denote these intervals by [/,r.] for j=2,...,n. Take the interval that ends first and the 
interval that begins last, i.e., L=argmin / .(r / .) and R=axgmaXj(lj). Since X 1 n...r\X n =0 
then [/ 2 ,r 2 ]n...n[/ n ,rj=0, hence it follows that [/ R ,r R ]n[/ £ ,rJ=0. We get 
(X 1 nX L nX R )=0. □ 

In case no SNP is present in all X.-s, the above proof provides an algorithm for 
finding three individuals whose full genotypes solve MTI. A slight modification al- 
lows finding two individuals instead of three when possible. The time complexity is 
0(nm). Let F=X 1 n...nX n . 

Corollary 1: There are at most three individuals whose genotypes can resolve all the 
haplotypes on the SNP set S\F, and they can be found in 0{nm) time. 

In case F£0, the SNPs in Y can be inferred up to bit flipping. 

2.3 Experimental Results 

We implemented Gavril and Tamari’s algorithm for Graph Realization [17]. Although 
it is not the asymptotically fastest algorithm available, it is simpler to implement and 
modify than [15]. Moreover, as the block size is usually bounded in practice by n?<30, 
the quadratic dependence of the algorithm on m is not a handicap. Our implementa- 
tion, GREAL, was written in C++, and is available at http://www.cs.tau. ac.il/ 
-rshamir/greal. Another implementation due to Chung and Gusfield has recently been 
announced [19]. 

We used a standard population genetics simulator due to Hudson [22] to generate 
data samples under the perfect phytogeny model. In each run we generated c=2400 
chromosomes with a prescribed number of SNPs, preserving the default values for all 
other simulation parameters. An important parameter in the experiments was the mi- 
nor allele frequency cutoff, denoted by a: For a given value of a, we only used SNPs 
whose less frequent allele occurred in >ac chromosomes. The resulting haplotypes 
were randomly paired to generate xor-genotypes of individuals. 

How many individuals are required to get a single solution? 

We evaluated this measure by randomly adding individuals one by one and reapply- 
ing GR till the solution is unambiguous. The results (Fig. 5) show that for a>0.03, the 
number of individuals required to obtain a single solution is roughly an a-dependent 
constant, irrespective of the number of SNPs, and is practically bounded by 70. When 
rare alleles (a=0.01) are present, the behavior is less predictable and the variance is 
very large. However, comprehensive sampling of the haplotypes is usually not 
achieved when rare alleles are present; fortunately, performance is satisfactory above 
the accepted a cutoff of 0.05. 
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Fig. 5. Conditions for uniqueness of the solution. The plots show the number of xor-genotypes 
(y-axis) needed for obtaining a single solution for a given number of SNPs (x-axis). Different 
lines (or least squares curves) correspond to different thresholds on the minor allele frequency 
cutoff a. Note that the interpolated curve for a=0.01 is an extremely rough estimate. 



XPPH vs. PPH 

Since xor-genotypes contain less information, they may have a potential economic 
advantage over full genotypes. However, the number of individuals required for ob- 
taining the haplotypes is larger. We compared the number of individuals needed by 
XPPH and by PPH. Chung and Gusfield [23] evaluated experimentally the number of 
individuals required for obtaining a unique solution to PPH. We computed the same 
statistic for XPPH (Fig. 6a). For 50 SNPs, 50 xor-genotypes guarantee -90% chance 
of uniqueness, and increasing the number of individuals has only a minor effect. Es- 
sentially the same results hold for 100 SNPs. In comparison to [23], the chances for a 
unique XPPH solution with > 50 xor-genotypes is only a few percent lower than for 
PPH data with the same number of full genotypes. 







number of individuals 



b) 




Fig. 6. The chance for a unique solution, (a) The frequency of a unique solution (y-axis) versus 
the number of individuals tested (x-axis). XPPH statistics are based on 5000 runs for 50 or 100 
SNPs after filtering with a=0.05. PPH statistics from [23] are plotted for comparison, (b) The 
distribution of the number of non-unique solutions in deep coverage studies. Statistics were 
collected for 35 configurations of the number of SNPs (100-2000) and the number of individu- 
als, which was at least 10 times the number of SNP equivalence classes. (a=0.05). 
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How high is the multiplicity of non-unique solutions? 

We further focused on outlier ambiguous datasets, i.e., those that are ambiguous de- 
spite the examination of many individuals. For such datasets, the number of possible 
solutions is of much practical interest: If this number is limited, each solution may be 
tested separately. Indeed, the results (Fig. 6b) show that in this situation, when the 
solution is not unique, there are only a handful of solutions, usually only 2. Note that 
we assume equivalence of = H and ~ x for outlier datasets, which we confirmed for the 
datasets used here. 



3 Informative SNPs 

3.1 Problem Definition 

In this section we study informative SNPs under the perfect phylogeny model. We 
begin by introducing some terminology, concordant with [13]. 

Definition: Let H={H 1 ,...,H n } be a set of haplotypes over a SNP set S= [ .v , , . . . ,s m } . 
Let S"(zS be a given subset of interesting SNPs. The set 5”&SV>" is informative on H 
w.r.t. S" if for each 1 <k,l<n, whenever there is a SNP s"eS" for which H ks ,^H, ,,, there 
is a SNP s 'e S ’ for which H ks ^H [s ,. 

Note that that assumption that the informative and interesting SNPs are disjoint is 
made without loss of generality, since we can duplicate interesting SNPs as part of the 
candidates for the informative set. We generalize the Minimum Informative SNPs 
problem [13] by introducing a cost function, as follows: 

Problem 5: Minimum-Cost Informative SNPs (MCIS): 

Input: (a) A set of haplotypes H={H 1 ,...,H n } over a SNP set S={s l ,...,s m ] along with 
a perfect phylogeny (T,f) for H. 

(b) A set of interesting SNPs S'tS. 

(c) A cost function C:S— >R + . 

Goal: Find a set S”qSV>" of minimum total cost that is informative w.r.t. S". 

( Tf) may already be known if H was found by solving XPPH. Alternatively, it can 
be computed in O(mn) time from haplotypes [21]. 

A common task which is related to picking an informative SNP set is to describe 
all of the haplotype variation in the region [20]. Formally, we seek a tag set S’czS s.t. 
for each l<l,k<n, there is te S’ for which H k j£H, f In order to find tag SNPs of mini- 
mum cost, one could duplicate the SNP set S and define one of the copies as interest- 
ing. A solution to MCIS on the duplicated instance is a tag SNP set of minimal cost. 
Hence we shall focus on the more general MCIS problem. 



3.2 An Algorithm for MCIS 
3.2.1 Problem Decomposition 

Recall that if T=(V,E) is a perfect phylogeny for H nxm then { 1 n]cV, i.e., the hap- 

lotypes of H label nodes in the perfect phylogeny. If a node of T is labeled by a haplo- 
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type from 77 we say it is observed. Otherwise we say it is ancestral. Ancestral nodes 
represent haplotypes that have been intermediate stages in evolution but did not sur- 
vive to the present, or were not collected in the sample. It is easy to see that the leaves 
of T are always observed. The observed internal nodes in T can be used for a decom- 
position of T as follows: 

Definition: An ancestral component is a subtree of T in which all the internal nodes 
are ancestral and all the leaves are observed. 

Since the leaves of T are observed, T can be represented as a union of edge-disjoint 
ancestral components, where each union step merges two components by identifying 
copies of the same observed node. Two different components can share at most one 
observed node, but do not share ancestral node. Partitioning T into ancestral compo- 
nents is straightforward. We now show that in order to find informative SNPs we can 
divide the tree into ancestral components and find informative SNPs for each single 
component separately. The subproblem on a component is defined as follows: Denote 
an instance of MCIS by the input tuple I=(H,S,C,T,fS”). Let T v ...,T p be 7”s ancestral 

components where 7j=( V,E.). Denote by SfzS the SNPs that label E,. The input tuple 
for T j is /.=(//. ,5, C-, 7)/., 5. ”) where the sets and functions are the restriction of the 
original sets and functions to S p 

Theorem 4: Suppose for every i, IS(Il) solves /. Then IS( I)=IS(I\ )u. . .uAS'l/j) 
solves I. 

Proof: We shall show that 75(7) is informative w.r.t. S" iff IS(I t ) is informative w.r.t. 
S" for all i; The theorem then will follow by the additivity of the cost function. If 
haplotypes kj belong to the same observed component T-, and there is a SNP s such 
that H ks ^H ls , then by the key property it must be that se 5-. Therefore, the informa- 
tiveness of 75(7) implies the informativeness of 75(7) for all i. For the opposite direc- 
tion, suppose there are teS" and 1 <l.k<n such that H k j£H. . Let 7j be the subtree which 
contains the edge with label t (i.e., fe 5). Then by the key property, there are l’,k' in 7j. 
such that H k .(£H Vt , where I’.k’ are the observed nodes of 7j. that are on the path from k 
to / in T. But then there is ,s ’e /5(/ ( )(z/5(/) such that H k , s ^H lv . Hence, by the key 
property, H ks *H k „ □ 

3.2.2 Solving MCIS on an Ancestral Component 

In this section we solve MCIS restricted to a single ancestral component. We first 
reformulate it in terms of the tree edges, and then show how to solve it. We introduce 
the following notations: Edges labeled by interesting SNPs are called target edges. 
The set of target edges is T={e\ f ~ l {e)r\S"^0] . It specifies the interesting information 
in terms of tree edges. An edge is allowed if it is labeled by some non-interesting 
SNP. The set of allowed edges is ot={e\f 1 (e)rs(S\S")^0} . These are the edge-analogs 
of potentially informative SNPs. Edges in Act are called forbidden. Forbidden edges 
cannot be used as informative, but edges in in a can. 
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We now expand the definition of the cost function to edges: The cost of an edge e, 
denoted Cie), is the minimum cost of a non-interesting SNP that labels e. For ee t\a 
define C(e)=°°. This allows us to provide an equivalent formulation for MCIS: 

Problem 6: Minimum Cost Separating Set (MCSS) 

Input: The same input as for MCIS. 

Goal: Find E 'cE of minimum cost, such that in G=(V,E\E ’) there are no two observed 
nodes that are connected by a path containing a target edge. 

Proposition 4: MCIS and MCSS are equivalent. 

Proof: It suffices to show that an informative set for H w.r.t. S" separates those ob- 
served nodes that are connected by a path containing edges from x, and vice versa. 
Observed nodes of T, v, and v 2 , have corresponding haplotypes of IE H k and H,, and 

vice versa. But then by the key property H k j£H ls iff s labels an edge on the path from 
Vj to v 2 . □ 

We are now ready to outline a dynamic programming algorithm for MCSS. 
W.l.o.g. assume |V|>2. Take some internal node re V and root T at r. For ve V denote 
by T v =(V v ,E v ) the subtree of T that is rooted at v. For a solution S v cE v of the induced 
sub instance I(T ), denote by R the connected component which contains v in 
G v =(V ,E\S ). The algorithm will scan T from the leaves up and at each node v form 
an optimal solution for the subtree T based on the optimal solutions for the subtrees 
of its children. When combining such children solutions, we have to take into consid- 
eration the possibility that the combination will generate new paths between observed 
haplotypes, with or without target edges on them. To do this, we distinguish three 
types of solutions: S is called empty if there are no observed haplotypes in R . It is 
called connected if some observed haplotypes in R are connected to v via target 
edges. S is called disconnected otherwise, i.e., if there are observed haplotypes in R v 
but there is no path connecting an observed haplotype to v via target edges. Let N , P v 
and A denote the respective best empty, connected, or disconnected solutions. We 
define recursive formulae for their costs as follows: 

• For a leaf node ve V we initialize: C(N V )=°°, C(P V )= °°, C(A v )=0. 

• For an internal node ve V with children { m |5 . . ,,m W v) } we write: 

Tear (i) = min{c (/V ), C (P )+ C (v, u t ), C )+ C (v, u t )} 

k(v) 

C(N v )^Tear(i) 



(1) 

( 2 ) 



C (P v ) = min 





')+C(Nj-Tear(j)J, min { C ( A »,) +C ( JV v)- Tear 0)} 

J\v ,u Jr r 



( 3 ) 
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If {'| (v,w,.)g r}= (p then C(/l,)= °° 



Otherwise C (A v ) = c(A h )+ ^ miir{c(A H ) 7ear(/)}+ ^Tear(i) ^ 

( v , m , j (v ,Uj ) e T 

where j = argmin ( c(A m )-Tear(i )) 

i (v.m, k r 



The auxiliary value Tear(i) measures the cost of an empty solution for the subtree 
including the edge (v,n-) and the subtree of u r In computing C(P V ) we have to either 
pick the cheapest of two alternatives: (a) all the subtrees are empty except one which 
is connected (first term in (3)), (b) all the subtrees are empty except one that is dis- 
connected but incident on v via a target edge (second term). In computing C(A v ) we 
find the best disconnected subtree, and allow the remaining subtrees to be either dis- 
connected or empty. These formulae are implemented in a dynamic program as fol- 
lows: (1) Visit V in postorder, computing C(N V ), C(P ,,) and C(A V ) for each ve V. Ob- 
tain the minimal cost by min ( C(N r ),C( P r ),C(A r )} . (2) Compute N v , P v and A y by 



following the traceback pointers to get all those (v,n.) edges that were chosen by the 
minimal cost while taking c ^ p ) + c( v „ ) or C ^ A ) + c(v u )• The time complexity of 

this algorithm is 0(|*S'|). 



3.3 Tag SNPs from Genotypes 

Up until now we have followed the standard assumption in the computational litera- 
ture [13,24,25] that tag SNPs need to reconstruct the full binary haplotypes from bi- 
nary haplotypes of the tag set. As experiments that provide haplotypes are expensive, 
most studies seek to obtain experimentally only genotypes. For such data, the problem 
of finding tag SNPs should be reformulated to reflect the fact that the input is geno- 
types, rather than haplotypes: Recall that standard genotyping has three possible calls 
per site: [0,0], {1,1} and {0,1}, where the first two are homozygous and the latter is 
heterozygote, (The calls are often abbreviated to 0,1, and 2 respectively, and the geno- 
type is represented as a vector over {0,1,2}.) The following question arises: Find a 
subset of SNPs given whose genotype calls one can completely identify the pair of 
haplotypes of an individual. We call such subset phasing tag SNPs. 

Formally, let H be a set of haplotypes over a set S of SNPs, and consider genotypes 
formed from haplotype pairs in H. Denote by g(k,l) s the genotype formed from H k and 
H t on the SNP set S. We say that [i v i 2 \ and are distinct with respect to S if 

there is se S such that g(i l ,i 2 ) s ^g(j l J 2 ) s . 

Definition: S’qS is a set of phasing tag SNPs if every two haplotype pairs from PI are 
distinct with respect to S'. Hence, from the genotype calls of an individual for the set 
S’, one can uniquely determine the exact sequence of the complete set S for each of its 
two haplotypes. 
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In general, the definitions of phasing tag SNPs and tag SNPs differ (see Fig. 2). 
The former is stronger: 

Observation 5: If S t.S’ are phasing tag SNPs then they are also tag SNPs. 

Proof: All homozygous genotype-call vectors are distinct w.r.t. S’: for all &j, 
g(i,i) s ^g(jj) s ,. □ 

We now show that, surprisingly, under the perfect phytogeny model, tag SNPs and 
phasing tag SNPs are equivalent. This identifies the commonly used definition with 
the more theoretically sound one, and therefore justifies the application of the current 
body of theory on tag SNPs to genotype data. 

Theorem 5: Suppose that the haplotypes in H satisfy the perfect phytogeny model on 
S. A set S t.S' is a tag SNPs set if and only if S’ is a phasing tag SNPs set. 

Proof: It suffices to prove the “only if’ direction. Suppose to the contrary that S' are 
tag SNPs but not phasing tag SNPs. Let G' ( = (//,.//,} and Gj={H 3 ,H 4 ) be distinct hap- 
lotype pairs with the same genotype call vector for S’, i.e., g(l,2) s ,=g(3,4) s> . Since S’ 
is a tag SNP set, it distinguishes //, and H 3 , so there must be s 3 eS’ such that G, and Gj 
are heterozygous to s v and H ] and H 3 have different alleles for s,. Similarly there 
must be s 2 e S ’ such that Gj and Gj are heterozygous to s 2 , and H x and H 4 have differ- 
ent alleles for s 2 . Therefore Gj and Gj are oppositely phased on ,v , and s 2 . Since H v H 2 , 
H 3 , and H 4 are distinct, they violate the 4 gamete rule on s 3 ,s 2 , in contradiction to 
Theorem 1. □ 

4 Discussion 

We studied here several questions arising in haplotype inference under the perfect 
phytogeny model. We introduced the model of xor-genotypes, and showed results that 
lay the computational foundation for the use of such data: (i) Inference of the sample 
haplotypes (up to negation) by adapting graph realization algorithms, (ii) Only two or 
three additional full genotypes are needed to completely resolve the haplotypes. 

Simulations with genetic data show that xor genotypes are nearly as informative as 
full genotypes. Hence, genotyping methods that distinguish only between heterozy- 
gotes and homozygotes could potentially be applied to large scale genetic studies. 
Xor-genotypes may have economical advantage over the complete genotypes com- 
mon today, since the information in a xor-genotype is only a fraction of the informa- 
tion given by a complete genotype. The feasibility and economic benefit of xor- 
genotype data cannot be appreciated by currently available technologies, but this 
work lays the foundation for evaluating the cost-effectiveness of technologies for 
obtaining such data. 

The second part of the manuscript studied choosing a subset of the SNPs that fully 
describes the sample haplotypes. We provided efficient solutions to several optimiza- 
tion problems arising in this topic: We generalized previous results by finding optimal 
informative SNP set for any interesting set, and more generally, showed how to han- 
dle differential costs of SNPs. Finally, we have shown how to find tag SNPs for geno- 
type data, which generalize the definition of tag SNPs to a more practical aspect. 
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Abstract. We consider the problem of sorting linear and circular per- 
mutations and 0/1 sequences by reversals in a length-sensitive cost model. 
We extend the results on sorting by length-weighted reversals in two di- 
rections: we consider the signed case for linear sequences and also the 
signed and unsigned cases for circular sequences. We give lower and upper 
bounds as well as guaranteed approximation ratios for these three cases. 
The main result in this paper is an optimal polynomial-time algorithm 
for sorting circular 0/1 sequences when the cost function is additive. 



1 Introduction 

Sorting by reversal (SBR) is of key importance in the study of genome rearrange- 
ment. To date, most of the algorithmic work on this problem applies to linear 
sequences (permutations or strings), and assumes the naive unit-cost model. 
This model corresponds to a simple evolutionary model in which inversion mu- 
tations of any length are considered equally likely [1]; however, the mechanics of 
genome-rearrangement events (both inversions and other less ubiquitous trans- 
formations) suggest that the probabilities of reversals are dependent on fragment 
length. 

Recently, a length-sensitive model was introduced [2,3], in which the cost of 
a reversal is a function of the length of the reversed sequence and the total cost 
is the sum of the costs of the individual reversals. Several nontrivial lower and 
upper bounds have been derived for the problem of SBR under this cost model. 
However, these results pertain only to the unsigned version of the problem, 
where the direction of the individual elements does not matter (as opposed to 
the signed version); in the study of genome rearrangement, the direction of the 
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elements is often important, since each element represents a whole gene (or a 
portion thereof) whose direction has significance. Furthermore, many interesting 
genomes are circular ; these include bacterial, mitochondrial, and chloroplast 
genomes, which all play an important role in studies of comparative genomics. 

In this paper we extend the recent work for the length-sensitive model on 
linear sequences in which the cost of reversing a subsequence of length £ is 
f(£) = £ a (where a > 0). We consider both permutations as well as 0/1 sequences 
representing the extreme cases of no repetitions at all among the elements on one 
hand, and as many repetitions as makes sense on the other hand. The elements 
are arranged in both linear and circular fashion, and we consider both the signed 
and unsigned cases. 

Note that in general circularity offers more opportunities for lowering the 
optimal cost to sort a given sequence by reversals, and at the same time circu- 
larity presents challenges to providing efficient solutions. A non-unit cost model 
exacerbates the problems even farther. A sequence that exemplifies this distinc- 
tion is 10”/ 2-1 l"/ 2-1 0. One can sort the sequence into 0”/ 2 l"/ 2 with 2 reversals. 
Under a length-sensitive model with a = 1 (i.e. , the reversal cost is identical to 
the length of the sequence) the overall cost is 2n — 2. In the circular case, where 
we adjoin the two ends of the sequence to form a circular arrangement, 1 reversal 
operation is enough to sort the sequence; its cost in the length sensitive model 
is 2. Thus, whereas the ratio between the costs of the two optimal solutions in 
the unit-cost model is 2, in the length-sensitive model it is <9(n). Consequently, 
the treatment of circular sequences requires more care than the linear case. 

Notice that the following relationships between the costs of the four sorting 
cases hold: 



unsigned circular < unsigned linear < signed linear . (1) 

unsigned circular < signed circular < signed linear . (2) 

A summary of our results appears in Tables 1 and 2; note that the lower 
and upper bounds for the linear, unsigned case were extended verbatim from [3], 
whereas the approximation ratios vary somewhat among the different cases. 

Table 1 . Lower and upper bounds for SBR of signed or unsigned and linear or circular 
0/1 sequences and permutations. 



a Value 


Lower Bounds 


Upper Bo 
Permutations 


unds 

0/1’s 


0 < a < 1 


12(n) 


0(n lgn) 


6>(n) 


a = 1 


!?(n lg n) 


0{n\g z n) 


<9(nlgn) 


1 < a < 2 


U(n“) 


0(n a ) 


0(n a ) 


a > 2 


77{r?) 


0(n ‘) 


©K) 



Considerable research has been reported on the algorithmics of SBR (mostly 
signed) for linear sequences in the unit-cost model, see e.g. [4,5]. Recently, sev- 
eral papers have addressed the issues of circularity and duplication, most notably 
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Table 2. Approximation ratios for SBR of signed linear as well as signed and unsigned 
circular 0/1 sequences and permutations. 



a Value 


Signed Lin< 
Permutations 


3ar 

0/1’s 


Unsigned Cir 
Permutations 


cular 

0/1’s 


Signed Circi 
Permutations 


alar 

0/1’s 


0 < a < 1 




O(l) 




0(1) 




0(1) 


a = 1 


O(lgn) 


3 


O(lgn) 


1 


0(lg n) 


3 


1 < a < 2 


O(lgn) 


0(1) 










a > 2 


0(1) 


0(1) 


0(1) 


0(1) 


0(1) 


0(1) 



Chen and Skiena [6] who deal with fixed-length reversals to sort both linear and 
circular permutations, Hartman [7] who deals with sorting by transpositions 
(which are much less common than reversals) on a circle, and Christie and Irv- 
ing [8] who look at sorting (essentially) 0/1 linear sequences by both reversals 
and transpositions. None of these results, however, pertain to the length-sensitive 
model. 

The rest of this paper is organized as follows. In Sect. 2 we provide algorithms 
for sorting 0/1 sequences, in Sect. 3 we deal with sorting circular permutations, 
and in Sect. 4 we show how to handle signed permutations in our length-sensitive 
model. Finally, in Sect. 5 we give sorting bounds for permutations and 0/1 
sequences when inputs are signed or unsigned and circular or linear. 

2 Exact and Approximation Algorithms 
for Sorting 0/1 Sequences 

In this section we give exact and approximation algorithms for sorting linear 
signed as well as circular singed and unsigned 0/1 sequences. We first introduce 
approximation algorithms when 0 < a < 1. We then give an exact algorithm 
to sort unsigned circular 0/1 sequences when a = 1. Finally, we introduce a 
reduction showing that the signed case can be approximated using the unsigned 
case when a > 1. 

2.1 Approximation Algorithms for 0 < a < 1 

We now give O (1) approximation algorithms for sorting linear signed as well as 
circular signed and unsigned sequences. The results are proved using potential 
function arguments. 

Sorting Unsigned Circular Sequences. Given a circular sequence S, denote 
the length of the 0 and 1 blocks contained in S by z \ , . . . , Zk and wi , . . . , Wk 
respectively. Let Z = maxi<j<fc{zj} and W = maxi<j<fc{wi}- We define the 
potential function P(S) as follows: 

fc 

P{S) = Y J {z?+<)-Z a -W a . 

i = 1 




Sorting by Length- Weighted Reversals: Dealing with Signs and Circularity 



35 



Lemma 1. A reversal p of length r acting on a circular sequence S increases the 
value of the potential function P(S) by at most 4 r“, that is, P(S-p)—P(S) < 4 r a . 



Proof. The proof is by case analysis: since a reversal can cut two blocks at most 
(the blocks at either edge of the reversal), a reversal increases the value of the 
term )>2 l ',_i( z f + tc“) by at most 2 r". In addition, by similar reasoning, the value 
of either Z a or W a can decrease by at most r“. □ 

Denote S' = S-p. Notice that S = S' -p^ 1 . Therefore, the reversal p decreases 
the potential value by the same amount that p _1 increases it. Thus, by applying 
Lemma 1 on the inverse reversal, we get a lower bound on the decrease of the 
potential value. 

Corollary 1 . A reversal p of length r acting on a circular sequence S decreases 
the value of the potential function P(S) by at most 4 r a , that is, P(S ■ p) — P(S) > 
—4 r“. 

Because the cost of a reversal of length r is r a , and the value of P(-) equals 
0 for a sorted sequence, we obtain the following lower bound. 

Lemma 2. The function V(S) = \P(S) is a lower bound for sorting an un- 
signed circular sequence S. 

Proof. The proof is by induction on the number of reversals in an optimal solu- 
tion. 

Let in denote the number of reversals in an optimal sorting series. We prove 
that if a sorting solution uses exactly m reversals, then its cost is at least V(S). 

Base case: when m = 0, the sequence is already sorted. Induction step: 
suppose for all m < k the claim holds. Consider a 0/1 sequence S having an 
optimal sorting series of length m = k + 1. Denote the first reversal by p and 
let r be its length. The reversal p changes the sequence S to S', which can be 
optimally sorted by k reversals. Applying the induction assumption on S', we get 
that V (S') is a lower bound for sorting S'. By Corollary 1, P(S') + 4 r“ > P(S). 
By the definition of V(-) we get: V(S') +r a > V(S). Therefore: 

opt(S) = opt(S") + r“ > V(S') +r a > V(S) . 

as needed. □ 

Lemma 2 motivates sorting a circular 0/1 sequence by fixing its two maximal 
blocks Z and W, and linearly sorting the two subsequences between them. The 
algorithm circularlmprovedDC realizes this approach. Given a circular sequence 
S and two block indices i and j, consider the subsequence of blocks between i and 
j, counter clockwise, excluding the blocks i and j. If the subsequence starts with 
a 1, rename the 0’s and l’s and define S(i,j) to be the resulting subsequence. 
Define S(j,i) similarly. In the following we use the algorithm improvedDC, in- 
troduced in [3], for sorting linear 0/1 sequences. 

Theorem 1 . The algorithm circularlmprovedDC has an approximation ratio of 

0 ( 1 ). 
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Algorithm 1 circu 1 ar I rnproved I ) C ( S) 

1: Let Zi 0 = Z and Wj 0 = W 

2: Define Si = S(io,jo) and S 2 = S(jo,io) 

3: si <— improvedDC(Si) 

4: s 2 <— improvedDC(S 2 ) 

5: Output si + S 2 



Sorting Linear and Circular Signed Sequences. Consider a signed 0/1 
sequence (linear or circular). Define a block in the sequence to be a contiguous 
segment of 0’s (or l’s) having the same sign. Notice that there are four kinds of 
blocks in a signed 0/1 sequence. 

Represent the 0/1 sequence as a sequence of blocks &i, ... , b m . Consider the 
potential function V'i(S') = ^ f° r linear sequences S and define ^(T) as 

in Lemma 2 for circular sequences T . 

Lemma 3. The potentials and V 2 (T) are lower hounds on the cost of 

sorting linear and circular signed sequences respectively. 

Given a signed sequence S, let unsign(S') represent the sequence without the 
signs. The algorithm signedlmprovedDC sorts signed linear sequences based on 
improvedDC. 



Algorithm 2 signedlmprovedDC (S) 

1: U <— unsign)^) 

2: u <— impovedDC(C) 

3: Mimic the reversals used to sort U on S. Denote the resulting sequence by S 1 
4: Reverse elements of S' with a negative sign. Let s be the cost of this step 
5: Output s + u 



To sort circular signed sequences we modify the algorithm circularlmprovedDC 
in a similar way. We refer to the modified algorithm as signedCircularlmproved 
DC. 

Theorem 2. The algorithms signedlmprovedDC and signedCircularlmproved 
DC are 0(1 ) approximation algorithms. 



2.2 Optimal Algorithm for a — 1 

In this section we give a polynomial-time algorithm for sorting circular 0/1 
sequences with additive cost functions (a = 1). The sorting algorithm is based 
on dynamic programming. We give enough properties to constrain the set of 
candidate solutions, so that the optimal solution can be found in polynomial 
time. This approach was used in [3] to sort linear 0/1 sequences. 
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We first describe the three properties, “useless”, “cutting” and “complex.” 
The first two properties can be proved using the same techniques as in [3]. The 
main contribution of this part is in proving the third property, which is the most 
critical in establishing the optimality of the algorithm. Its proof is substantially 
different from the linear case, reflecting the essential divergence caused by the 
circularity. This is explained in detail below after we review some preliminaries. 

Consider the two extreme blocks in a reversal. If the values of these blocks 
are identical, that is both equal 0 or 1, we call the reversal useless ; if one of 
these two blocks is contained in a bigger block, we call the reversal cutting. If 
the reversal affects more than two blocks, we call the reversal complex. We call 
a reversal that is not complex simple. We can show that there are no useless 
and cutting reversals in a circular optimal solution. As mentioned earlier, the 
proof follows the same lines as in [3]. Roughly speaking, the same techniques 
apply because these proofs are “local,” involving changes of reversals where the 
changes affect a single block of Os or Is (even though the reversal is longer) . This 
explanation is elaborated below. 

In contrast, the proof of the third property for linear sequences introduced 
in [3] depends heavily on the linearity and does not apply to circular sequences. In 
the proof by contradiction, one considers the last complex reversal. The changes 
that are introduced affect the two extreme blocks that the reversal contained. In 
the case of linear sequences, these blocks are far apart and one can subsequently 
make changes independently to the reversals involving these blocks. This prop- 
erty is not true for circular sequences. Specifically, reversals that involve one 
extreme block may also affect the other, because of “wrap-around”. 

In the following, all the sequences are circular unless mentioned otherwise. 

Given a reversal series p\, . . . , p m acting on a 0/1 sequence S = Si, . . . , s q , 
where Si £ {0,1}, denote the number of reversals in which element Si participates 
by N(si). Call N(si) the reversal count of Sj. 

When a subsequence s*, . . . , Sj of S is never cut by a reversal series pi, ... , p m , 
we denote the number of reversals in which the subsequence takes part by 
N(si, . . . , Sj). 

We show that for additive cost functions no optimal reversal series contains 
useless or cutting reversals, and there exists an optimal reversal series containing 
no complex reversals. Equation (3), relating the reversal counts to the reversal 
series cost is useful for the proofs. 

m q 

\ = J2 N ^ ■ ( 3 ) 

i= 1 1=1 

The proofs of useless and cutting, appearing in [3], hold for the circular case. 
Lemma 4. A reversal series containing a useless reversal cannot be optimal. 
Lemma 5. A reversal series containing a cutting reversal cannot be optimal. 

The following definitions are needed to prove that an optimal reversal series 
containing no complex reversals exists. Given a reversal p affecting a circular 
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0/1 sequence S, notice that we can cut S, making it linear, without cutting 
the reversal. Such a linearization is done in a counter clockwise direction. This 
linearization is implicitly used throughout the section. 

We represent a 0/1 sequence l Wl , . . . , 0 W2t by the lengths of each block, that 
is we let w = Wi , . . . , wu denote a 0/1 sequence. We refer to w as a weighted 
sequence. A subsequence of w, sg = Wi,...,Wj or for short (i,j), is called a 
segment of w. Let sg k = w, k , . . . ,Wj k for k G {1,2} be two segments of w. We 
say that sgi is a sub-segment of sg 2 , or contained in sg 2 , if %2 < *i < j l < J2- 

We denote a reversal acting on a segment (i,j) of w by p(i,j). Let w = 
Wi , . . . , u> 2 e be a weighted sequence, and let p ( i,j ) be a reversal acting on it. If 
i = j + l(mod2), or for short i = j + 1, we say that p unifies wi with Wj+ 1, and 
Wj with Wi- 1. The reversal p is called a unifying reversal. In addition, we say 
that block Vi-i of v = w ■ p = Vi, . . . , V 2 C -2 contains blocks Wi-i and Wj of w. 
Similarly we say that Vj - 1 of v contains blocks Wj + 1 and Wi of w. 

A reversal series pi, , p m acting on w separates a segment sg of w, if 
pi, . . . , p m unifies all the 0 and 1 blocks of sg. If sg = w then pi, . . . , p m sorts w. 

In the sorting process the essence of a reversal is not directly connected to 
the coordinates (i,j) that define the reversal, but is designated by the blocks 
that the reversal affects. This essence plays a crucial rule when we try to define 
commutative reversals. 

Given a reversal p acting on w and a reversal p acting on w • p, we say that 
commutes with p if two reversals rf and p' exist such that w ■ p ■ r) = w ■ rf ■ p' , 
where rf and p' affect the same blocks of w that 77 and p affected respectively 
(in the following we drop the primes). 

One can verify that the following proposition holds. 

Proposition 1. Let w be a weighted sequence, p be a reversal acting on it, and 
p be a reversal acting on w ■ p. The three following conditions are equivalent: 

1. The reversal g commutes with p. 

2. The blocks that p affects in w • p are contiguous in w. 

3. The segments that p and p affect are either contained one in another or 

disjoint. 

Given a weighted sequence w and a minimum sorting series pi, ... , p rn , define 
wi = w ■ pi ■ ■ ■ pj . Let pj be the greatest index complex reversal and s the 
segment that pj affects. Denote by s the rest of the sequence. Let pk for k > j 
be the smallest index reversal such that s is separated in w k . Consider msg, the 
maximal segment that contains s and is separated by pj, . . . , pk- If msg 7^ w, the 
circularity of w is not used in the separation of msg. Therefore, we can consider 
msg as a linear 0/1 sequence, and pj as a complex reversal used through the 
separation of a linear 0/1 sequence. Under these assumptions, [3] proved that 
msg can be separated optimally without the use of complex reversals. Therefore, 
without loss of generality, we assume that msg = w. That is, s is separated only 
when w is sorted. The following lemmas characterize the reversals performed 
after pj. 




Sorting by Length- Weighted Reversals: Dealing with Signs and Circularity 



39 



Lemma 6. Let w, j, pj ,w° , s, and s be as above. Let pk for k > j be a reversal 
that does not commute with pj . Then pk affects at least one block that contains 
mixed elements from s and s. In addition, all the blocks that pk affects become 
part of a mixed block. 

Lemma 7. Let w,j, pj, w J , s, and s be as in Lemma 6. Let pk for k > j be the 
smallest index reversal that commutes with pj. Then pk commutes with all the 
reversals p q for k > q > j. 

Corollary 2. Let w,j,pj,wfs, and s be as in Lemma 1. We can rearrange the 
reversal series so that the reversals pk for k > j do not commute with pj . 

The reversals pk for k > j are simple reversals and by Corollary 2 do not 
commute with pj. 

The following lemmas characterize the sorting process under simple reversals. 
They are used for proving that the reversal series cannot be optimal if pj remains 
complex after performing the rearrangement described in Corollary 2, and for 
calculating the optimal cost in polynomial time. 

Given a weighted sequence w = Wi, ... , W 21 and a reversal series pi , . . . , p m 
containing no cutting reversals, define c, to be the number of reversals in which 
Wi takes part, i.e. Ci = N (wf). Notice that c, is well defined, since pi,...,p m 
contains no cutting reversals. 

Lemma 8. Let w = wi, . . . , W 2 t be a weighted sequence and let pi, . . . , p m be a 
sorting series having no useless, no cutting, and no complex reversals. Denote 
w k = w ■ pi ■ ■ ■ pk, and let w° = w. Then each block of w k contains a block of w 
that takes part during pi , . . . , pk in zero reversals. 

Proof. By induction on k. Base case: the claim is trivial for k = 0. Induction 
step: suppose each block of w k contains a block of w that takes part during 
pi, . . . , pk in zero reversals. We need to prove that each block of w k+1 contains 
a block of w that takes part during pi, . . . , pk+i in zero reversals. Since Pk+i is 
simple, we know that it acts on two successive blocks of w k . Since pk+i is not 
cutting, pk+i must be of the form p (i — 1, i). Such a reversal unifies the block w k 
with w k _ 2 , and w\ : _ 1 with w k +i . The other blocks of w k are not affected by pk+i, 
therefore, they contain by the induction assumption a block of w that takes part 
during pi, ..., pk in zero reversals. The same block takes part during pi,...,pk 
in zero reversals as well. 

Since Pk+i affects w k unifying it with w k _ 2 , and by the induction assumption 
w k _ 2 contains a block v of w that takes part during pi,. . . ,Pk in zero reversals, 
the unification of w k with w k _ 2 in w k+1 contains v. Since v is not affected 
by pk+ 1 , v takes part during pi, ... , pk in zero reversals, and thus fulfills the 
requirement. 

A similar analysis applies to the unification of w k _ x with w k +1 in w k+1 . □ 

Lemma 9. Let w = Wi, . . . , W 2 i be a weighted sequence and let pi, . . . , p m be 
a sorting series having no useless, no cutting, and no complex reversals. There 
exist indices i and j, where i corresponds to a block of 1 ’s and j corresponds to 
a block of 0 ’s, such that Ci = Cj = 0. 
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Lemma 9 is helpful in proving that the outcome of Corollary 2 is a reversal 
series containing no complex reversals. 

Lemma 10. Let w,j, ()j. w 3 , s. and s be the outcome of Corollary 2, that is pk 
for k > j do not commute with pj . If pj remains a complex reversal , the reversal 
series pi , . . . , p m cannot be optimal. 

Proof. We show that a reversal series having a lower cost exists. Consider the 
sequence w 3 . The reversals pk for k > j are simple. By Lemma 9 there exists 
indices i' and p\ where i' and p' correspond to a block of l’s and 0’s respectively, 
such that d' = c p > = 0. Notice that each of the blocks i' and p' could be contained 
as a unit in w 3-1 , or could correspond to two blocks, one in s and one in s. In 
the latter case, pick the part of the block that is in s. In the former pick the 
blocks themselves. Denote the indices of the picked blocks in w 3-1 by i and p. 
We divide the analysis into three cases: 



1. The indices i and p are contained in s. In this case we get c p = c* = 0, and 
therefore i and p divide the circular sequence into two linear subsequences. 
We consider the subsequence between i and p that contains s, denote it v, and 
the restriction of p\ , . . . , p m to v. This restriction separates v and contains a 
complex reversal followed by simple reversals that do not commute with it. 
As mentioned before, [3] proved that such a series cannot be optimal. 

2. The indices i and p are contained in s. This imposes that \i — p\ = 1, or 
else the reversals acting on the blocks between i and p commute with pj. 
Suppose that i appears before p in s when scanning the circular sequence in 
the counter clockwise direction. The other case is handled by renaming the 
0’s and l’s. Performing the reversals pk for k > j restricted to s and to s 
separates s, while s gets the form 0 + l + 0 + l + . The notation 0 + corresponds 
to an arbitrary strictly positive weight of 0’s. The whole sequence is of the 

s s s 

form 0 + 0+1+0+1+ 1 + . Notice that the orientation of s and s is opposed 
since pj is not performed on s. To sort the sequence, perform a last reversal 

s last reversal s 

0 + 0+ 1+0+ 1 + 1 + . The cost of the modified series is not greater than 

the cost of the original one, since the last reversal’s length is not greater 
than pj' s length. However, by Lemma 6, the restriction of Pj+i to s or to s 
produces a reversal containing a single block. If we omit this reversal from 
the modified series the cost of the modified series becomes smaller than the 
original one. A contradiction to the optimality of the original series. 

3. The index i is contained in s and p contained in s or vice versa. We anal- 
yse the former case. The latter is handled by renaming the 0’s and l’s. 
Notice that after performing pj, the indices i and p divide the circle into 
two independent components. The reversal p m can affect only one of these 
components. Consider the following division of the sequence w 3 (counter 



s , where Si and Si are the segments 



clockwise): 0 + 
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Table 3. Available reversal counts summary. 



Segment 


S2 


i 


Si 


S2 j 


Block 


1^ 


0+ 


1 + 


F" 


T+ 


0+ 


0+ 




0+ 


Available reversal count 


2 


2 


T 


~T 


~T 


1 


1 


l 


0 


Explanation: reversals skipped 


Pj , pm 


Pj , pm 


pj 


pj 


pj 


pj 


Pm 


Pm 





of s and s respectively lying between p and i, while S2 and S2 are the seg- 
ments of s and s respectively lying between i and p. Suppose that p m affects 
S2 and 6'2- Then the reversals pu for m > k > j separate si and si. Notice 
that omitting pj does not affect the separation state of these segments. On 
the other hand, performing p j. for m > k > j on s 2 and S2 does not separate 
them. Modify the reversal series as follows: Perform p for m > k > j on 
w 3-1 restricted to each of Si, si, sq., and S2 (do not perform pj and p m ). The 



resulting sequence is of the form: 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + l + 0 + . 
Table 3 summarizes the yet available reversal counts for part of the segments. 
To sort the remaining blocks perform the following reversals (reversals indi- 
cated by brackets [,]): 



s 2 i s 1 S 2 Si S 2 




The two reversals sort the sequence, and do not violate the available reversal 
counts. Thus the modified reversal series has a cost not greater than the 
original one. A contradiction is established similarly to previous cases. 

Putting it all together we prove the claim. □ 

Lemmas 4, 5, and 10 establish Theorem 3. 

Theorem 3. An optimal reversal series contains no useless and no cutting re- 
versals. In addition, an optimal reversal series containing no complex reversals 
exists. 

Theorem 3 and Lemma 9 implies that finding the optimal sorting cost can 
be done by taking the minimum over the following set: for all pair of indices 
( i,j ), where i < j and i — j = 1, sort the segments sgi = ( i,j ) and s<?2 = (j, i) 
under the restriction Ci = Cj =0. The sum of the two sorting costs is a candidate 
for the optimal solution. The calculation of the sorting costs for all pairs (i,j) 
and (j, i) can be done by running zerOneSort on ww, where w is an arbitrary 
linearization of a circular 0/1 sequence. The algorithm zerOneSort, introduced 
in [3], is an exact algorithm for sorting unsigned linear sequences when a = 1. 
The algorithm circularZerOneSort implements this idea. 

Theorem 4. The algorithm circularZerOneSort sorts a circular 0/1 sequence 
optimally with a time complexity in O (n 3 ) and a space complexity in O (n 2 ) 





42 



Firas Swidan et al. 



Algorithm 3 circularZerOneSort(w) 

i: Run zerOneSort on ww. Let A be the dynamic programming matrix built by ze- 
rOneSort 
2: opt <— oo 

3: for i = 0 to |w| — 1 in steps of 1 do 

4: for j = i + 1 to |w| — 1 in steps of 2 do 

5: tmp <— A (*, j, a = Cj = 0) + A(j,i + \w\ , Cj = a = 0) 

6: if tmp < opt then 

7: opt <— tmp 

8 : end if 

9: end for 

10: end for 
11: Output opt 



2.3 Approximation Algorithms for a > 1 

We now give a reduction showing that a signed circular or linear sorting al- 
gorithm can be derived from an unsigned circular or linear sorting algorithm 
respectively, while retaining the asymptotic approximation ratio. 

Let A be an algorithm for sorting unsigned circular 0/1 sequences and T 
be a signed circular sequence. We can sort T by running A on unsign (T) and 
correct the signs of the resulting sequence elements. Lemma 11 shows that this 
method approximates the optimal sorting cost. A similar result applies to the 
singed linear case. 

Lemma 11. The sum of A(unsign(T)) and the cost of signs correction is less 
than opt(T) + 2A(unsign(T)). 

By combining the optimality of circularZerOneSort (Sect. 2.2) and Lemma 11, 
we obtain an approximation algorithm for signed circular sequences when a = 1. 
A similar result applies to signed linear sequences. 

Corollary 3. The algorithm circularZerOneSort. followed by signs correction is 
a 3- approximation algorithm for sorting signed circular sequences when a = 1. 

In [3] its shown that bubble sort is optimal for sorting unsigned 0/1 sequences 
when a > 2. By Lemma 11 we get an approximation algorithm for sorting signed 
sequences when a > 2. 

Corollary 4. Bubble sort followed by signs correction is a 3- approximation al- 
gorithm for sorting signed linear or circular sequences when a > 2. 

3 Sorting Circular Permutations 

In this section we present an approximation algorithm for sorting circular per- 
mutations when a = 1. The algorithm is based on an analysis that there exists 
a cut that changes the circular permutation into a linear permutation, such that 
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the sorting cost for the linear permutation is only a constant factor larger than 
the sorting cost for the circular permutation. The algorithm is therefore to try 
all possible cuts. The proof follows an accounting argument given in this section, 
and is heavily dependent on the additivity of the cost function. The main idea 
of the argument is that we can mimic the circular reversals that we cut with 
linear ones, such that the linear reversals go the other way around the circle, not 
crossing the cut. The accounting argument shows that the total cost increases 
by at most a constant factor. 

Let 77 be a circular permutation on n elements. We present 77 as a sequence 
of integers. Denote the n linear permutations equivalent to the circular identity 
permutation by Ij for 1 < j < n. In this section, unless mentioned otherwise, 
permutations are circular. 

Associate an index with each space between two successive elements of 77. A 
reversal affects an index i, if the reversal affects the two elements neighboring i. 
Map each index i to the number of reversals r, affecting the index. We refer to 
Ti as the index reversal count. Let i o be an index such that Vi 0 = minj{rj}. 

In the following we establish a lower bound on the sorting cost using r,; 0 . 
Denote the circular optimal sorting cost of a circular permutation 77 by cosc (77). 
The linear optimal sorting cost of a linear permutation 77 is denoted by losc(77). 

Lemma 12 . Let 77 be a circular permutation, pi,,.. , p m be an optimal reversal 
sequence, and io be as above. The following holds: 

cosc (II) >n-r io . (4) 

Proof. By definition, a reversal affects an index if and only if it affects its two 
neighbors. Let i be an index with two neighbors TCj and Ttj+i- We get that 
2 ■ r, < N ( 7 Tj) + N (7Tj+i), where TV ( 7 Tj) is the number of reversals in which 
element tt 3 takes part. Notice that the set of neighbors of all odd or all even 
indices does not contain repetition. Thus, we get: 

y TV (7 Tj) > max{ ^ 2 • n, ^ 2 • r*} . 

j i odd i even 

For optimal sorting series, one can verify that JT N ( 7 Tj) = cosc (77). Hence, 
cosc (77) > max{]T. odd 2 • ri,Jf ieve „2 • r.J. From the definition of i 0 we get: 
cosc (77) > n • r io . □ 

Lemma 12 enables us to mimic circular reversals by linear reversals, while 
keeping the sorting cost within a constant factor. Given an index i, define 77 \ 
to be the linear permutation derived by cutting 77 at index i. Define 77u n to be 
T-T-io ■ 

Lemma 13. Let 77, r», and io be as in Lemma 12. Consider the linear permu- 
tation 77i; n derived by cutting LI at index io ■ Let p be a reversal affecting io ■ The 
reversal p can be mimicked by two reversals on LIn n with a total cost less than 
2 n. 
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index y elements n—y—x elements x elements 

Proof. Represent 77 and i7n n as follows: «o N ' , 

where p affects the x and y elements. The reversal p can be mimicked on 77i; n 
by the following two reversals (the brackets [,] indicate the reversal): 

y elements n—y—x elements x elements 



x elements reversed n—y—x elements reversed y elements reversed 




The final result is equivalent to the circular permutation 77 • p and the cost 
of the two reversals is bounded by 2 n. □ 

Since there are Ti 0 reversals to mimic, the total cost of mimicking the circular 
sorting procedure by a linear one is bounded by 2 -n-r^. This yields a constant 
factor approximation. 

Theorem 5. The optimal circular sorting cost of 77 can be approximated up to 
a constant factor by the optimal linear distance of 77n n to one of the n linear 
permutations equivalent to the circular identity permutation. 

Proof. Define i o as in Lemma 13. By Lemma 13, we can mimic all reversals in an 
optimal circular solution by reversals in a linear solution, ending up with a linear 
permutation 7 ;o equivalent to the circular identity permutation. The following 
bound holds for the optimal linear distance between 77i; n and Ij 0 . 

d (77i in , Ij 0 ) < cosc (77) + 2 • n ■ r io . 

By Lemma 12 we get: d (77ii„, 7 J0 ) < 3cosc(77). □ 

Theorem 5 enables approximating the optimal circular sorting cost using al- 
gorithms for sorting linear permutations: cut the circular permutation at each 
index, approximate the distance between the cut and all Ij , and return the 
minimum result. Sorting linear permutations is done by the algorithm reorder- 
ReversalSort, introduced in [3], which is a O(logn) approximation algorithm. 
The algorithm circulaReordeReversalSort is based on this idea. 

Theorem 6. The algorithm circulaReordeReversalSort has an approximation 
ratio of O (log n) . 

4 Sorting Signed Permutations 

The algorithms for sorting unsigned permutations are modified to sort signed 
permutations, while retaining their approximation ratios. This is done using the 
Bafna-Pevzner reduction [9]. 

Theorem 7. The algorithm reorderReversalSort. can be modified to sort signed 
permutations, while retaining the algorithm’s approximation ratio. 

A similar result applies to the circular case. 




Sorting by Length- Weighted Reversals: Dealing with Signs and Circularity 



45 



Algorithm 4 circulaReordeReversalSort(Il) 
1: opt <— oo 
2: for all indices i do 
3: for all j € {1, 2, . . . , |77|} do 

4: sortingCost <— reorderReversalSort (jj 1 ■ 

5: if sortingCost < opt then 

6: opt <— sortingCost 

7: end if 

8 : end for 

9: end for 
10: Output opt 



5 Basic Sorting Bounds 

In this section we give sorting bounds for linear signed as well as circular singed 
and unsigned permutations and 0/1 sequences The results are extensions of the 
linear unsigned results introduced in [3]. Equations (1) and (2) make extending 
the upper and lower bounds straightforward. 



Upper Bounds 

By (1) and (2), the singed linear case is an upper bound to all other cases. The 
cost of changing the signs of all elements is in O(n). Thus, the unsigned linear 
cost summed to 0(n) is an upper bound. This fact combined with the upper 
bounds in [3] implies the results shown in Table 1. 



Lower Bounds 

A lower bound for the unsigned circular case is by (1) and (2) a lower bound 
for all other cases. The potential function arguments introduced in [3] can be 
readily modified to fit the unsigned circular case. Summary of the lower bound 
results, for both permutations and 0/1, appears in Table 1. 
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Abstract. Optimized spaced seeds improve sensitivity and specificity in local 
homology search [1], Recently, several authors [2-4] have shown that multiple 
seeds can have better sensitivity and specificity than single seeds. We describe 
a linear programming-based algorithm to optimize a set of seeds. Our algorithm 
offers a performance guarantee: the sensitivity of a chosen seed set is at least 70% 
of what can be achieved, in most reasonable models of homologous sequences. 
Our method achieves performance comparable to that of a greedy algorithm, but 
our work gives this area a mathematical foundation. 



1 Introduction 

Heuristic homology search programs, such as BLASTN [5], are used extensively in 
genome analysis. To achieve a fast runtime, BLASTN only builds alignments between 
regions sharing a common region of k letters, called a seed. It then verifies which seeds 
come from good alignments using a local method. The various BLAST programs have 
contributed much to scientific research since its birth, and their introductory paper is 
the most cited paper of the 1990s. 

BLASTN does not identify its seeds optimally. PatternHunter [1], introduced the 
use of spaced seeds in local alignment. This significantly increases sensitivity and com- 
putational efficiency. Since this introduction, spaced seeds have been studied by many 
researchers. Keich et al. [6] proposed a dynamic programming algorithm to compute 
the exact hit probability of a seed on a random region under simple distribution. Buhler 
et al. [2] expanded upon this and built a heuristic algorithm to find optimal seeds for 
alignments when the model of homologous regions is a Markov chain. Choi et al. [7] 
also studied how to calculate the exact hitting probability of a spaced seed and proposed 
a new algorithm for identifying the optimal spaced seed. Brejova et al. [8] considered 
conserved regions determined by hidden Markov models (HMMs), particularly for cod- 
ing regions. These regions have three-periodic structure, and variation in conservation 
level throughout the region. Later, they extended the idea of spaced seeds to vector 
seeds [4], which encapsulate more freedom in describing a single seed. 

A single spaced seed cannot achieve the sensitivity of exhaustive, quadratic runtime 
programs such as the Smith- Waterman algorithm [9]. There is a tradeoff between sen- 
sitivity and the false positive rate that was described in both the paper for PatternHunter 
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II [3] and the vector seeds work of Brejova et al. [4], These papers show that the use 
of multiple seeds (or one vector seed, which is equivalent) can have greater sensitivity 
than a single seed, even at the same false positive rate. 

Here, we show how to optimize the choice of multiple seeds for homology search. 
Optimizing a single seed is /VP-hard [3], and the exponential-runtime algorithms used 
for choosing the optimized single seed are not appropriate to the multiple-seed case 
[4, 7], Instead, we formulate the optimal multiple spaced seeds problem as a maximum 
coverage problem and round the solution to a linear program for that problem. Our 
algorithm works regardless of the probabilistic model of homologous regions, giving a 
set of seeds whose sensitivity is guaranteed to be within a factor of 1 — 1/e « .63 of 
the best possible. In some cases, we can achieve a much better factor. 

In practice, our algorithm returns seed sets whose sensitivity is comparable to the 
greedily chosen set [3,4]. The primary advantage of our algorithm is that not only are 
our seed sets good, we also have a bound on the sensitivity of the best possible set. 



2 Problem Formulation 

2.1 Alignments and Seeds 

We model ungapped pairwise nucleotide alignments, which we call homologous re- 
gions, by binary strings, where 1 represents a match, and 0 a mismatch. The length of 
a region R is \R\, and R[i] is the 7th bit of II. We model these regions probabilisti- 
cally, using a random distribution on binary strings. A random region model specifies 
the distributions of its length and its position-by-position distributions as well. 

Here, we consider the following two region models, introduced in Ma et al. [ 1 ] and 
Brejova et al. [8]: 

1. PH model [1]. In this model, each bit of the region is set to one independently with 
probability 0.7, and the length of the region is a constant 64 bits. 

2. HMM model [8]. This model represents coding regions. Here, each three bits of the 
region are produced as triplets, and the conservation level in one triplet depends 
on the previous triplet’s conservation level. This models regions of high and low 
conservation. The mean region length is approximately 90. 

With models for alignments, we turn to seeds. A spaced seed S' is a binary string. 
Its length is |S|, while its weight is its number of ones. A 1 in the seed S corresponds 
to a position with a required match and a 0 means “don’t care.’’ 

Definition 1. A hit to a seed S in a region R is an offset i such that R[i + k] > S[k]for 
all 1 < k < |S|. R is hit by the seed S if there is a hit to S in R. Given a set of seeds, 
A = {Si, S 2 , ■ ■ ■ Sfc}, R is hit by A when R is hit by at least one seed in A. 



2.2 Optimization Problems 

Given a probabilistic model P for regions and two integers W and M, the optimal 
single spaced seed problem is to find the spaced seed S* with weight W and length 
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no more than M that maximizes the probability of a hit to a region chosen from P. 
Similarly, for k > 1, the optimal k-seecls problem is to find a set A* of k seeds, each 
with weight W and length no more than M, that maximizes the hit probability. These 
problems are known to be /V/ J -hard. 

We can also choose a seed from a collection of seeds. Let A = {Si, S%, . . . , S m } 
be a collection of spaced seeds, and suppose we have a probabilistic model P for ho- 
mologous regions. The seed specific optimal k-seeds problem is to find the fc-element 
subset Q* of A that maximizes the probability of a hit to P. 

We will analyze this problem by generating many regions from the distribution of 
P, and then choosing a collection of seeds with high sensitivity to these instances of P. 
We will guarantee that our set of regions is representative by choosing enough regions. 
Given a collection of N such regions, R = {Pi, . . . , Rn}, and a set A of possible 
seeds, the seed-region specific optimal k-seeds problem is to find a fc-element subset 
O ' of A that hits the most regions in P. 

A reduction from the maximum coverage problem shows that the seed-region spe- 
cific optimal fc-seeds problem is itself AlP-hard. The following sections describe how 
to formulate the seed-region specific optimal seeds problem as a maximum fc-coverage 
problem and solve it by an LP based randomized algorithm. 

2.3 Bounds on Number of Instances 

We solve the seed-specific optimal fc-seeds problem by first generating a large set of 
random regions, and then choosing a good seed set for them. Chernoff bounds tell us 
how large a set we need so the hit probability for the regions is close to the model hit 
probability. In particular, the following is due to Chernoff bounds: 

Theorem 1. Let A be a set of seeds, P a model for random regions, and R \ , . . . , P y 
independent regions from distribution P. Assume that A has probability p of hitting 
regions from P, and that the fraction of the N seeds that are hit by A is p' . For any 
0 < 5 < 1, the following bounds onp' and p hold: 

1. Pr[p' — p > S\ < e- 2N5 \ 

2. Pr\p' -p<5\ < e~ 2N5 \ 

Hence, if we choose large enough N , we can approximate the desired probabil- 
ity, p, with the sample probability, p' . Once we choose the random regions, we cast 
the problem as a maximum fc-coverage problem. Suppose we have a collection P = 
{Pi, . . . , Rn} of regions, and a set S' = {Si, . . . , S m } of seeds. Let Hi be all regions 
that seed Si hits. Our selection problem is to find the set of fc sets Hi whose union is 
as large as possible; this is the maximum coverage problem. The obvious greedy algo- 
rithm approximates this to a 1 — 1/e fraction [10], and Feige has shown that this bound 
cannot be improved for the general maximum coverage problem [11]. 

3 Choosing Seed Sets by Linear Programming 

Our approach to optimizing the maximum coverage is through linear programming and 
randomized rounding. This approach improves upon the performance of the greedy 
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algorithm in our special case of maximum coverage. The integer program is straightfor- 
ward: binary decision variables Xj are 1 when seed Sj is selected, and yj are 1 if region 
Rj is hit. Our problem is then modeled this way: 



max ■ 



N 



N 

E 

3 = 1 



Vj 



(1) 



subject to: 



Uj < E Xu for i = !, • • • , N, (2) 

m 

J> = * (3) 

i—1 

Xi G {0, 1}, for i = 1, . . . , m, (4) 

Vj G {0,1}, fori = 1,..., TV. (5) 

Constraint 2 guarantees that a region R :J is hit only if at least one seed that hits it is 
chosen, while constraint 3 permits only k seeds to be chosen. 

3.1 A Simple LP-Rounding Bound 

No polynomial-time algorithm is known to exist for integer programs. Instead, we relax 
the integrality constraints on Xj and y :j and solve the resulting linear program. Then, 
we use a random rounding technique to pick a set of seeds. Assume the optimal lin- 
ear solution to the LP is (x*, y*). We create a probability distribution on the m seeds 
that consists of the scaled x* values, x*/k, and choose k indices, Si, . . . , Sk, from this 
probability distribution. We may choose the same index multiple times. Our rounded 
solution to the LP, (x + , y + ) has xf = 1 if we chose index i, and y+ = 1 if we chose 
one of the seeds that hits region Rj. The set S = { S, } of seeds whose indices we have 
chosen form our set of seeds. (This set may be smaller than k.) 

The performance of this algorithm is as good as the obvious greedy algorithm. 

Theorem 2. The LP-rounding algorithm generates a solution with expected approxi- 
mation ratio at least 1 — 1/ efor the seed-region specific optimal multiple seeds problem, 
and can be derandomized to ensure this performance deterministically. 

To prove this theorem, we first note that the probability that a particular seed Sj is 
not chosen is simple to estimate: 

Pr[x/" = 0] = (1 — -r~) k < e - ®**. 

rv 

Now, let us consider a single region and estimate its probability of being missed. 

Pr[y+ = 0] = Pr[ Q = 0], 

k:j£H k 
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since a region is missed if no seed that hits it was chosen. If one seed is not chosen, 
other seeds are more likely to be chosen (since one seed is chosen in each round), so 

Pr i n x t=°]^ n p r[xt = o] < n 

k:j£H k k:j£H k k:jeH k 



Moving sums to products, we see that Pr [yf = 0] < exp(— J2k-j&H k x k )• Since 
y* < x i’ th' 8 probability is at most exp(— y*), which for y between 0 and 1 

is bounded above by 1 — (1 — l/e)y*. Thus, the probability that y+ — 1 is at least 
(1 — 1/ e)y*. The expected fraction of the regions that are hit by the chosen set of seeds 
is at least a (1 — 1/e) fraction of the optimal objective function value for the linear 
program. Since the LP bound is an upper bound on the optimal integer programming 
solution, the expected approximation ratio of the algorithm is at least 1 — 1/e. This 
algorithm can also be derandomized by standard techniques [12, 13]. 



3.2 A Better Bound for the Rounding Algorithm 

In some cases, we can prove a better performance bound than 1 — 1/e. If we have 
lower and upper bounds on the number of regions hit by every possible seed, the seed 
set chosen by the randomized LP algorithm may have provable expected performance 
better than what can be proved in general. For example, in the PH model, any seed with 
weight 1 1 and length no more than 1 8 hits at least 30% of the sample regions if the 
number of samples is big enough. 

In this direction, we can prove the following theorem in the appendix: 



Theorem 3. If each seed in A hits at least ON regions and the optimal linear solution 
of the LP has objective function value l* (0 < l* < 1), then the rounding algorithm has 
expected approximation ratio at least $^ 1 ) 1 * (■*■ ~ e -fc ) + (1 — )(1 — e _1 )/or 

the seed-region specific optimal multiple seeds problem. 



Table 1 give the provable approximation ratio for the region models, for all seeds 
with length at most 18 and weight 1 1 . In the PH model, 9 « 0.30, while for the HMM 
model, 9 ss 0.73. In addition, in computing these two tables, we replace Z* with 1. 

In both cases, the approximation ratio is better than 1 — 1/e and improves as k 
grows. However, if the seed weight is larger than 1 1, then Theorem 3 cannot prove an 
approximation ratio larger than 1 — 1/e. But, if the seed weight is 10, or each region of 
the PH model is longer than 64, the theoretical approximation ratio will be better. 



Table 1. Approximation ratios for region models and seeds of length at most 18 and weight 11. 



number of seeds 


2 


4 


6 


8 


10 


12 


14 


16 


ratio for PH model 
ratio for HMM model 


H 


0.655 

0.856 


■ 


■ 


0.714 

0.890 


■ 


H 


0.725 

0.894 
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4 Experimental Results 



Here, we present experiments with our seed search algorithm. We show that our algo- 
rithm finds high quality seed sets, and we compare our algorithm and the simple greedy 
algorithm, documenting that both algorithms choose good seeds. 



4.1 Probabilistic Model Results 

We used our algorithms to find good seed sets for our two random region models, PH 
and HMM. We generated 5000 instances of each model, Rm = { H \ . . . Rn}- For any 
seed set A, let pr(A) be the fraction of these 5000 instances that were hit by the set. 

If our algorithm finds a seed set A, we can use the algorithms described by Keich et 
al. [6] or Brejova et al. [8] to compute the fraction pm(A) of regions from the model M 
hit by the seed. As an upper bound on pm, we use either 1 or a bound based on the LP 
solution. Let A* be the optimal set of seeds for the model, and let I* and l* be the IP 
and LP objective function optima, respectively. We know that pr(A*) < I* < l*, since 
the LP bound is an upper bound on the IP bound, and A* may not perform better than 
the IP solution on R. Theorem 1 shows that with probability at least 0.99, pm (A") < 
Pr(A*) + 0.0215. Thus, an upper bound on the best possible seed performance, with 
probability 0.99, is l* + 0.0215. To estimate the optimization ratio of our algorithm, we 
compare Pm{A) to l* + 0.0215. 



Table 2. Estimated approximation ratio of the seed sets generated by our algorithm with PH 
model and HMM model. 





j PH model 


i HMM model ] 


# of seeds 


W = 10, 
M < 18 


W = 11, 
M < 18 


W = 10. 
M < 18 


W = 11, 
M < 18 


2 


0.916 


0.899 


0.970 


0.962 


4 


0.934 


0.912 


0.961 


0.960 


6 


0.939 


0.919 


0.959 


0.960 


8 


0.943 


0.931 


0.953 


0.956 


10 


0.948 


0.931 


0.954 


0.953 


12 


0.950 


0.936 


0.953 


0.950 


14 


0.951 


0.937 


0.953 


0.950 


16 


0.952 


0.938 


0.951 


0.948 



Table 2 shows the estimated approximation ratio of the seed sets generated by our 
algorithm for seeds of weight 10 or 11 and length at most 18. Our seed sets are at least 
90% as good as the best possible (with probability 0.99). Solving one linear program- 
ming instance takes seven hours to terminate on a single 500MHZ CPU, which we note 
is affordable, given that one only computes these seed sets once. 
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4.2 Seed Sets for ESTs 

Here, we test the sensitivity of our seed sets against real alignments generated by com- 
paring human and mouse EST sequences. Out of a collection of 29715 mouse EST 
sequences and 4407 human EST sequences, we randomly chose three million pairs to 
test for sensitivity. We computed the optimal alignments between these sequences using 
the Smith- Waterman algorithm (with matches scoring +1, mismatches and gap exten- 
sions -1, and gap openings -5), and tested to see how many of them were hit by a seed 
set. We identified true positives as being hits to alignments scoring at least 16, while 
false positives were hits to alignments with lower score. Please refer to Li el al. [3] for 
the detailed description of this procedure. 



Table 3. sensitivity and false positive values of chosen seed sets for PH model and HMM model. 
TP: hit rate of the pairs with alignment score at least 16. FP: the ratio between the number of hit 
pairs and the number of all the pairs. 





PH model 


HMM model 




W = 


11, M < 18 


W = 


10, M < 18 


W = 


11, M < 18 


W = 


10, M < 18 


#seeds 


TP 


FP 


TP 


FP 


TP 


FP 


TP 


FP 


1 


0.341 


0.167 


0.661 


0.485 


0.363 


0.171 


0.647 


0.490 


2 


0.529 


0.310 


0.835 


0.677 


0.539 


0.317 


0.869 


0.703 


3 


0.648 


0.411 


0.907 


0.771 


0.675 


0.419 


0.932 


0.793 


4 


0.715 


0.483 


0.948 


0.825 


0.751 


0.506 


0.958 


0.836 


5 


0.763 


0.535 


0.964 


0.854 


0.801 


0.568 


0.973 


0.865 


6 


0.810 


0.595 


0.973 


0.874 


0.840 


0.619 


0.983 


0.888 


7 


0.838 


0.639 


0.986 


0.895 


0.859 


0.651 


0.989 


0.900 


8 


0.883 


0.678 


0.988 


0.903 


0.882 


0.687 


0.991 


0.909 


9 


0.888 


0.700 


0.993 


0.916 


0.902 


0.712 


0.993 


0.917 


10 


0.898 


0.718 


0.994 


0.922 


0.914 


0.734 


0.995 


0.924 


11 


0.919 


0.747 


0.995 


0.926 


0.930 


0.758 


0.996 


0.930 


12 


0.931 


0.763 


0.996 


0.931 


0.938 


0.771 


0.997 


0.934 


13 


0.937 


0.777 


0.996 


0.935 


0.941 


0.781 


0.997 


0.937 


14 


0.947 


0.792 


0.997 


0.940 


0.952 


0.799 


0.998 


0.941 


15 


0.952 


0.804 


0.998 


0.943 


0.955 


0.808 


0.998 


0.943 


16 


0.954 


0.810 


0.998 


0.944 


0.959 


0.818 


0.998 


0.946 



We used the seed sets chosen in the previous experiment to test how well they did in 
this practical test, and Table 3 shows the results. Most notably, the seed sets all quickly 
converge to high sensitivity. It is worth noting that four well-chosen seeds of weight 1 1 
will often have better sensitivity and specificity than one seed of weight 10. Figure 1 
further compares the performance of four seeds with weight 1 1 chosen to optimize the 
HMM model and a single seed with weight 10, which have similar false positive rates. 
The set of four weight 1 1 seeds is much more sensitive on the pairs with an alignment 
score between 16 and 40 than the single HMM seed with weight 10. For a pair with 
score at least 40, any seed can easily detect it. Our findings show that using seeds with 
weight 1 1 can detect more subtly similar pairs than using seeds with weight 10 at the 
same false positive level. 
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four seeds with weight 1 1 vs. one seed with weight 10 




Fig. 1. Sensitivity comparison of four seeds with weight 1 1 chosen for the HMM model and a 
single seed with weight 10. 



4.3 Curated Coding Alignments 

We further tested our algorithm on two protein coding DNA alignment data sets, one 
of human/mouse alignments, and one of human/fruit fly alignments. Each data set is 
treated separately. Please refer to Brejova et al. [8] for a detailed description of these 
data. In the fly/human set, 972 alignments are used for training and 8 10 for testing, while 
in the mouse/human set, 2171 alignments are used to train and 1660 to test. These are 
available at 

http : / / raonod . uwaterloo . ca/ supplements/ 03 seeds. We used the train- 
ing samples to search for the optimal seed sets with our LP algorithm (note that no 
probabilistic model is used here), and then tested our seed sets on the test samples. 

Table 4 shows the sensitivity of the seed sets with weight 10 and weight 11. As 
shown in these two figures, four seeds with weight 1 1 are 2% more sensitive than a 
single seed with weight 10. However, there is no difference between the sensitivity of 
k ik = 2, 3,4) seeds with weight 10 and 4 k seeds with weight 11, which is because 
the alignments themselves are fairly high-scoring. A single optimal seed with weight 
10 hits 86% of the human/fruit alignments and 91% of human/mouse alignments. 

4.4 Comparison with Greedy Algorithm 

We also implemented the greedy algorithms described in Li et al. [3] and Brejova et 
al. [4] and compared it with our LP-based randomized algorithm. Table 5 compares the 
sensitivity of the seed sets found by two algorithms on the two curated sets of coding 
alignments; the results are comparable. The advantage is that the LP based algorithm 
enables us to estimate a usually much better approximation ratio of the solutions, while 
the greedy algorithm only gives a very poor (1 — 1/e) approximation ratio. 
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Table 4. Sensitivity of the seeds on coding region alignment data 





j fruit fly vs. human 


1 mouse vs. human j 


# seeds 


00 
r— 1 

VI 

< 

o 

II 

£ 


OO 

VI 

£ 

t-H 

t-H 

II 


W = 10, M < 18 


W = 11, M < 18 


1 


0.863 


0.781 


0.919 


0.872 


2 


0.920 


0.831 


0.952 


0.910 


3 


0.944 


0.868 


0.961 


0.930 


4 


0.956 


0.885 


0.975 


0.938 


5 


0.956 


0.910 


0.980 


0.948 


6 


0.964 


0.912 


0.981 


0.948 


7 


0.974 


0.917 


0.984 


0.961 


8 


0.973 


0.920 


0.989 


0.968 


9 


0.981 


0.928 


0.989 


0.972 


10 


0.981 


0.933 


0.988 


0.970 


11 


0.983 


0.937 


0.989 


0.973 


12 


0.980 


0.940 


0.990 


0.970 


13 


0.985 


0.947 


0.990 


0.981 


14 


0.985 


0.948 


0.993 


0.977 


15 


0.989 


0.949 


0.993 


0.980 


16 


0.986 


0.957 


0.997 


0.978 



Table 5. Performance comparison of seeds generated by the greedy algorithm and the LP based 
randomized algorithm. 





fruit fly vs. human 


; mouse vs. human | 




\W = 10, M < 18 


00 

VI 

t-H 

t-H 

II 

£ 


\W = 10, M < 18 


| W = 11, M < 18| 


#seeds 


| Greedy 


LP 


Greedy 


LP 


| Greedy 


LP 


Greedy 


LP 


1 


mm 




0.781 


0.781 




0.919 


0.872 


0.872 


2 


Em 




0.831 


0.831 




0.952 


0.902 


0.910 


3 


Ml 




0.860 


0.868 


0.970 


0.961 


0.922 


0.930 


4 


mm 




0.886 


0.885 


0.977 


0.975 


0.941 


0.938 


6 


0.963 


0.964 


0.911 


0.912 


0.984 


0.981 


0.957 


0.948 


8 


0.970 


0.973 


0.925 


0.920 


0.991 


0.989 


0.964 


0.968 




0.978 


0.981 


0.938 


0.933 


0.993 


0.988 


0.970 


0.970 


12 


0.985 


0.980 


0.944 


0.940 


0.994 


0.990 


0.973 


0.970 


14 


0.989 


0.985 


0.948 


0.948 


0.995 


0.993 


0.977 


0.977 


16 


0.989 


0.986 


0.951 


0.957 


0.996 


0.997 


0.981 


0.978 



5 Discussion 

The experimental result with real data demonstrates that using more seeds with a bigger 
weight is better than using fewer seeds with a smaller weight in detecting the subtly 
similar sequence pairs. However, using more seeds with a bigger weight requires much 
more memory to index the sequence database. 

As we know, for a general maximum /.'-coverage problem, the greedy algorithm and 
the LP-based algorithm have the same approximation ratio, (1 — e -1 ). An interesting 
question is if the greedy algorithm can also guarantee a better theoretical bound for the 
optimal multiple seeds problem, given that the experimental results described in this 
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paper indicate that the greedy algorithm has comparable performance to our LP-based 
algorithm. 
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A Appendix: Proof of Theorem 3 

We now prove the theorem from above. First, we give two lemmas. 

Lemma 1. Given a\, a 2 , b\, 6 2 > 0. If ^ > (3\, ^ > /? 2 , a ^+b 2 — a > an ^ 

Pi > Pi > 0, then ^±|i > otfii + (1 - a)p 2 . 

This lemma is a simple number fact, and trivial to verify. 

Lemma 2. Given Z\, Z 2 , .... z n , b > Zi > 1, b > 1, ^"_i Zi = Then we have 

Er=i(l - e- 2 *) > f5f(! - e- h ) + (n - fff )(1 - e" 1 ). 

Proof. If there exist Zi, Zj, (1 < i < j < n) such that 1 < Zj . Zj < b. Without loss of 

generality, assume z^ < Zj. For any 0 < e < min (zi — 1, b — Zj), we have 

1 _ e-^-0 + 1 - e -(^+ £ ) < l - e~ Zi + 1 - e~ Zi 



since 

(e e — l)(e~ Zi — e~ Zj ~ e ) > 0 

This indicates that if there are two z t , Zj which are not equal to 1 or b , then we can 
always adjust their values to decrease the value of ^”=1 (1 — e ~ Zi ) while keep E"=i z% 
unchanged. Therefore, — e~ Zi ) becomes minimum if and only if at most one 

Zi is not equal to 1 or b. If all z t equal to 1 or b , then the number of Zi equal to b is at 
least The lemma holds. If there is one k (1 < k < n) such that 1 < Zk < b, then 
by simple calculation we have 



Ed- 



•\ E Tl . L, , Z 

*) > -T-^r(l - e- b ) + (n- )(1 - e- 1 ) 



i = 1 



b- 1 

T^r (e ~‘~ e ~ 1)+ 



6-1 



e 1 — e Zk 



> - e~ b ) + (n - - e- 1 ). 



6-1 



6-1 



Now, assume that each seed hits at least 9N (0 < 9 < 1) sample regions. Summing 
up the right hand side of (2), for any feasible solution of the linear program, we see that 

EE E i-.jtHi X i > 0 (EEl x i) = dkN - 

Let Ni be the number of y variables with y* less than one, and 7V 2 the number 
of y variables with y* equal to one. Let l* denote the objective function value of the 
optimal linear solution. Let Yj* be the sum of the y* that are less than one, and be 
the sum of the corresponding right hand sides of (2). These are equal, since we raise Y* 
as much as possible. Similarly, let Yf be the sum of the y* which equal one (Obviously, 
Y,j = A ? 2 ), and let the sum of the corresponding right hand sides of (2). Then we 
have the following equations: 



X* +X;> 9kN 
X; < kN 2 
Yf + Yf = Nl* 
iVi > XI 
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Based on the above equations, simple calculations show: 

^2 



X* - N* 



x* - r 2 * 



Nl* 



> 



Nl* 

OkN - Y* - Y 2 * 



Nl* 



> 



ek-r 



i* 



( 6 ) 



Now we finish the proof of Theorem 3. According to the discussion in Section 3.1, 
we have the following two inequalities: 



Pr[y+ = 1] > (1 - 1/ e)y* 
Pr[l// = 1]>1 — exp(— ^ x* k ) 
Based on Equation 7, we have 






> 1 - e 



-1 



E J:y *<i y*j 

We need to deal with the case that y* equals to 1. Based on Equation 8, we have 



Er.y^iVj 



> 



No 



Ed- 



-E, 



*) 



i:y*=i 
> 1 ( X i-No 
~ N 2 1 k - 1 



e 

(1 _ e - fc ) + (N 2 - X 2 



Xf - 7V 2 



> 



x 2 7tv 2 - 1 
k - 1 



jfe-1 

x$/n 2 -i. 



(1 — e ~ k ) + (1 — ^ _~i )( 1 ~ e ~ 1 ) 



where (10) comes from Lemma 2. 

Based on (9), (11), and Lemma 1, we have 

Ef=i £%+] E,-:;,,-.! + E j:ri <i E\y+] 
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(k-l)Nl* y ' v (fc-l)iV7* 
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(A; — 1)Z* 



(Jfc-1)/* 



where (13) comes from (6). 

This completes the proof of this theorem. 
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Abstract. Given two undirected trees T and P, the Subtree Homeo- 
morphism Problem is to find whether T has a subtree t that can be 
transformed into P by removing entire subtrees, as well as repeatedly 
removing a degree-2 node and adding the edge joining its two neigh- 
bors. In this paper we extend the Subtree Homeomorphism Problem to 
a new optimization problem by enriching the subtree-comparison with 
node-to-node similarity scores. The new problem, denoted ALSH (Ap- 
proximate Labelled Subtree Homeomorphism) is to compute the home- 
omorphic subtree of T which also maximizes the overall node-to-node 
resemblance. We describe an 0(m 2 n/ log m + mn log n) algorithm for 
solving ALSH on unordered, unrooted trees, where m and n are the 
number of vertices in P and T, respectively. We also give an 0(mn) 
algorithm for rooted ordered trees. 



1 Introduction 

The matching of labelled tree patterns, as opposed to linear sequences (or 
strings) , has many important applications in areas such as bioinformatics, semist- 
ructured databases, and linguistics. Examples include the comparison among 
metabolic pathways, the study of alternative evolutionary trees (phytogenies), 
processing queries against databases and documents represented in e.g. XML, 
and many fundamental operations in the analysis of natural (and formal) lan- 
guages. In all these scenarios, both the labels on the nodes as well as the structure 
of the underlying trees play a major role in determining the similarity between 
a pattern and the text in which it is to be found. 

There are several, increasingly complex ways to model these kinds of prob- 
lems. A starting point is the subtree isomorphism problem [15,16,23]: Given a 
pattern tree P and a text tree T, find a subtree of T which is isomorphic to P 
(i.e. find if some subtree of T, that is identical in structure to P, can be ob- 
tained by removing entire subtrees of T) or decide that there is no such tree. 

* Research supported in part by the Bar-Nir Bergreen Software Technology Center of 
Excellence. 

** Research supported in part by the Aly Kaufman Post Doctoral Fellowship and by 
the Bar-Nir Bergreen Software Technology Center of Excellence. 
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Fig. 1. Subtree homeomorphism. The 
white nodes in the text are degree-2 
nodes that have been removed. Two 
subtrees in the text that are homeo- 
morphic to the pattern are circled by 
the dashed lines. 



Pattern 



Text 




Fig. 2. Approximate Labelled Subtree Home- 
omorphism. The node-label similarity scores 
are specified in table A, and D = —1. Two 
subtrees in the text that are homeomorphic 
to the pattern are circled by the dashed lines. 
The left homeomorphic subtree has an LSH 
score of 5, while the right one has an LSH 
score of 12. 



The subtree homeomorphism problem [3, 20] is a variant of the former problem, 
where degree 2 nodes can be deleted from the text tree (see Figure 1). The 
constrained, tree inclusion problem [26] is a variant of labelled subtree homeo- 
morphism where label equality between pairs of aligned nodes in the compared 
subtrees is required. Note that all the tree matching problems mentioned so far 
have polynomial-time algorithms. The more advanced tree inclusion problem [14] 
is that of locating the smallest subtrees of T that include P , where a tree is in- 
cluded in another tree if it can be obtained from the latter by deleting nodes. 
Here, deleting a node v from a tree entails deleting all edges incident to v and 
inserting edges connecting the parent of v with the children of v. Tree inclusion 
on unordered trees is NP - complete [14]. Schlieder and Naumann [21] extended 
the tree inclusion problem to an approximate tree embedding problem in order to 
retrieve and rank search results using a tree-similarity measure whose semantics 
are tailored to XML data. The complexity of their algorithm, which is based in 
dynamic programming and processes the query tree in a bottom up fashion, is 
exponential. 

This paper addresses queries on labelled trees. The trees could be either or- 
dered or unordered, rooted or unrooted - depending on the application at hand. 
The main point is, however, that even though the trees are labelled, label equal- 
ity is not required in a match. Instead, a node-to-node similarity measure is 
used to rank the resemblance or distance between pairs of aligned nodes. This 
extension is motivated by the aforementioned applications, where a more mean- 
ingful match can be found if - in addition to the topological structure similarity 
between subtrees - node-to-node resemblance is also taken into account. For ex- 
ample, in bioinformatics, the similarity between metabolic pathways [17, 22] is 
based both on the resemblance of the elements constituting the pathways (e.g. 
proteins) as well as on the likeness of their network structure (the former would 
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reflect closeness between proteins, based on either a BLAST sequence compar- 
ison score or on functional homology, and the latter would be the topological 
equivalent of multi-source unrooted trees [19]). A query that searches for a small 
pathway fragment in a larger tree should take into account the topology similar- 
ities (in the form of subtree homeomorphism) as well as the pairwise similarity 
between proteins which make up the aligned subtree nodes. Another example 
is when designing a semantic query language for semistructured databases that 
are represented as XML documents [21]: here the node-to-node similarity score 
may reflect content or tag resemblance, and the topological difference allows 
flexibility in document structure matching [2]. Finally, in natural language pro- 
cessing, trees are often used to represent sentences, where nodes are labelled by 
words and sentential forms (or constituents); matching that takes into account 
synonyms and elisions is very useful in order to detect semantically close phrases. 

Thus, in this paper we address the challenge of extending labelled subtree 
homeomorphism into a new optimization problem, by introducing node-to-node 
similarity measures that are combined with the topological distance between the 
pattern and text to produce a single, comprehensive score expressing how close 
they are to each other. 

Let A denote a predefined node-to-node similarity score table and V denote 
a predefined (usually negative) score for deleting a node from a tree (Figure 2). 
A mapping M[T\,T 2 ] from T\ to T 2 is a partial one-to-one map from the nodes 
of Ti to the nodes of T 2 that preserves the ancestor relations of the nodes. We 
define the following similarity measure for two homeomorphic trees. 

Definition 1. Consider two labelled trees T\ and T 2 , such that T 2 is homeomor- 
phic to T\, and let Ai\T\,T 2 ] denote a node-to-node, homeomorphism-preserving 
mapping from T\ to T 2 . The Labelled Subtree Homeomorphism Similarity 
Score of M[Ti,T 2 ], denoted LSH(M[Ti,T 2 ]), is 

LSH(M[T u T 2 ])=Vx (|T 2 |-|Ti|)+ ^ A[u,v\ 

(u,u)GM 



Correspondingly, 

Definition 2. The Approximate Labelled Subtree Homeomorphism 
(ALSH) Problem is, given two undirected labelled trees P and T, and a scoring 
table that specifies the similarity scores between the label of any node appearing 
in T and the label of any node appearing in P, as well as a predefined node 
deletion penalty, to find a homeomorphism-preserving mapping M[P, t\ from P 
to some subtree t ofT, such that LSH(Ai[P 1 t\) is maximal. 

In this paper we show how to compute optimal, bottom-up alignments between 
P and every homeomorphic subtree t of T, which maximize the LSH score be- 
tween P and t. Our algorithms are based on the close relationship between sub- 
tree homeomorphism and weighted assignments in bipartite graphs. The ALSH 
problem is recursively translated into a collection of smaller ALSH problems, 
which are solved using weighted assignment algorithms (Figure 3). (Simpler, 
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maximum bipartite perfect matching algorithms were applied in the algorithms 
for exact subtree morplrisms [3,9,13,23,26].) 

Our approach yields an 0(in 2 n/ log m+mn log n) algorithm for solving ALSH 
on unordered, unrooted trees, where ra and n are the number of vertices in P 
and T, respectively. Note that the time complexity of the exact subtree isomor- 
phism/homeomorphism algorithms [23,26], which do not take into account the 
node-to-node scores, is 0(m 15 n/ log m). Thus, the enrichment of the model with 
the node similarity information only increases the time complexity by half an 
order. We also give an 0(mn) algorithm for the problem on rooted ordered trees. 

Also note that a related set of problems, where dynamic programming is 
combined with weighted bipartite matching, is that of finding the maximum 
agreement subtree and the maximum refinement subtree of a set of trees [12, 
24] . Such algorithms utilize the special constraints imposed by the properties of 
evolutionary trees (internal nodes contain less information than leaves, similarity 
assumption allows for scaling, etc). 

The rest of the paper is organized as follows. Section 2 includes weighted 
bipartite matching preliminaries. In Section 3 we describe the basic 0(m 2 n + 
?nnlogn) time ALSH algorithm for rooted unordered trees and show how to 
extend it to unrooted unordered trees without increasing the time complexity. 
These solutions are further improved in Section 4 to yield an 0(m 2 n/ log m + 
ran log n) solution for both rooted and unrooted unordered trees. In Section 5 
we describe the 0{mn) time ALSH algorithm for rooted ordered trees. 

Note that due to lack of space, most proofs are omitted. The proofs will be 
given in the journal paper. 



2 Weighted Assignments and Min-Cost Max Flows 

Definition 3. Let G = (V = X U Y, E) be a bipartite graph with edge weights 
w(x,y) for all edges (x,y) £ E, where x £ X and y £ Y. The Assignment 
Problem is to compute a matching M, i.e. a list of monogamic pairs (x £ X, y £ 
Y), such that the size of M is maximum among all the matchings in G, and 
w(x,y ) is maximum among all the matchings with maximum size. 

(x,y)£M 

Although researchers have developed several different algorithms for the assign- 
ment problem [1], many of these algorithms share common features. The succes- 
sive shortest path algorithm for the minimum cost max flow problem appears 
to lie at the heart of many assignment algorithms [11]. The reduction from an 
assignment problem to a minimum cost max flow problem is as follows. Let s, t 
be two new vertices. Construct a graph G' with vertex set V U {s,t}, source s, 
sink t, and capacity-one edges: an edge (s,x) of cost zero for every a: £ A, an 
edge (y, t) of cost zero for every y £ Y, and and edge (x, y) of cost —w(x, y) 
for every (x, y) £ E. An integral flow / on G' defines a matching on G of size 
|/| and weight —cost(f) given by the set of edges {x,y} such that (x,y) has 
flow one. Conversely, a matching M on G defines a flow of value \M\ and cost 
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— w(x,y ). This means that we can solve a matching problem on G by 

{x ,y)£M 

solving a flow problem on G' . 

Edmonds and Karp’s algorithm [5] finds a minimum cost maximum flow 
in G' in 0(EV \ogV) time. In each stage of this algorithm, Dijkstra’s algo- 
rithm [4] is used to compute an augmentation path for the existing flow /. Each 
run of Dijkstra’s algorithm is guaranteed to increase the size of M by 1, and 
thus the algorithm requires at total of 0(V ) phases to find a maximum score 
match. Fredman and Tarjan [7] developed a new heap implementation, called Fi- 
bonacci heap, that improves the running time of Edmond and Karp’s algorithm 
to 0{VE + V 2 logV). The latter bound is the best available strongly polyno- 
mial time bound (the running time is polynomially bounded in only V and E, 
independent of the edge costs) for solving the assignment problem. 

Under the assumption that the input costs are integers in the range 
[-C, . . . ,C\, Gabow and Tarjan [8] used cost-scaling and blocking flow tech- 
niques to obtain an 0(V 1 / 2 E log(UC)) time algorithm for the assignment prob- 
lem. Two-plrase algorithms with the same running time appeared in [10,18]. 
The scaling algorithms are conditioned by the similarity assumption (i.e. the 
assumption that C = 0(n k ) for some constant k) and have the integrality of 
cost function restriction. 

The following lemma, which was first stated and proven in [11], will serve as 
the basis for the proofs of Lemma 2 and Lemma 4. 

Lemma 1 ([5, 11,25]). A flow f has minimum cost iff its residual graph R 
has no negative cost cycle. 

3 Dynamic Programming Algorithms for ALSH 

3.1 The Basic Algorithm for Rooted Unordered Trees 

We use d(v) to denote the number of neighbors of a node v in an unrooted tree, 
and c(v) to denote the number of children of a node v in a rooted tree. 

Let T r = ( Vt,Et,t ) be the text tree which is rooted in r, and P r = 
(Vp, Ep,r') be the pattern tree which is rooted in r' . Let p r tl denote a sub- 
tree of P r which is rooted in node u of P r , and denote a subtree of T r which 
is rooted in node v of T r . 

We define RScores[v € Vt,u £ Vp] as follows. 

Definition 4. For each node v £ Vp and each node u € Vp, RScores[v,u\ is the 
maximal LSH similarity score between a subtree p 0 of P r and a corresponding 
homeomorphic subtree oft r v , if such exists. Otherwise, RScores[v,u\ is —oo. 

The first algorithm, denoted Algorithm 1, computes the values RScores[v,u\ 
recursively, in a postorder traversal of T r . First, RScores[v,u\ are computed for 
all leaf nodes of T r and P r . Next, RScores[v,u ] are computed for each node 
v £ Vt and u £ Vp, based on the values of the previously computed scores 
for all children of v and u as follows. Let u be a node of P r with children 
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Fig. 3. The work done by the first algorithm during the computation of RScores[v,u], 
The bipartite graph is constructed in order to compute the optimal weighted assignment 
between the children of u and those of v. Wij marks the optimal LSH similarity score 
for aligning the subtrees rooted at yj £ T and Xi £ P. 



x\ , . . . , x c ( u ) and v be a node of T r with children y 1 , . . . , y c (v) ■ After computing 
RScores[yj,Xi] for i = 1 ,...,c(u) and j = 1, . . . , c(v), a bipartite graph G is 
constructed with bipartition X and Y, where X is the set of children of u, Y is 
the set of children of v , and each node in X is connected to each node in Y. An 
edge ( Xi,yj ) in G is annotated with weight RScores[yj,Xi\ (Figure 3). 

RScores[v,u } is then computed as the maximum between the following two 
terms: 

1. The node-to-node similarity value A[v, zt], plus the sum of the weights of the 
matched edges in the maximal assignment over G. Note that this term is 
only computed if c(u) < c(v). 

2. The weight RScores[yj , u\ for the comparison of u and the best scoring child 
yj of v , updated with the penalty for deleting v. 

Upon completion, the optimal LSH similarity score, denoted bestscore, 
where best score = max RScores [yj , x m ] , is found, and every node y 3 £ T with 

3 = i 

RScoreslyj^Xm] = best score is reported as a root of a subtree of T which bears 
maximal similarity to P under the LSH measure. 

Time Complexity Analysis 

Theorem 1. 

1. Algorithm 1 computes the optimal ALSH solution for two rooted unordered 
trees in 0(m 2 n + mn log n) time. 

2. Under the similarity assumption, Algorithm 1 yields an 0(m 15 nlog(nC)) 
time complexity. 

Proof. The time complexity analysis of Algorithm 1 is based in the following 
observation. 

m n 

Observation 1. c{u) = m — 1, and c(v) = n — 1. 

U— 1 V=1 

Algorithm 1 computes a score once for each node pair (v £ T,u £ P) . The 
dominant part of this computation is spent in obtaining the weighted assignment 
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between a bipartite graph with c(v)+c(u) nodes and c(v ) x c(u) edges. Therefore, 
based on Fredman and Tarjan’s algorithm [7], each assignment computation 
takes 0(c(u)(c(u ) x c(v) + c(v) log c(v))) = 0(c(u) 2 x c(v) +c(u ) x c(i>) log c(t>)) 
time. 

Summing up the work spent on assignment computations over all node pairs, 
we get 



£( C ( U ) 2 C (u) + c ( u ) c ( v ) lo gc(w)) obser = tIon 1 

U= 1 V=1 

m 

r / \2 / \ -i \ observation 1^/9 i \ 

(J(>c(u) n + c(u)nlogn) = (J(m n + mnlogn). 

U— 1 

Under the (similarity) assumption that all scores assigned to the edges of G 
are integers in the range [-C, . . . ,C\, where C = 0(n k ) for some constant k, 
Algorithm 1 can be modified to run in time 0(m 1 ' 5 nlog(nC')) by employing the 
algorithm of Gabow and Tar j an [8] for weighted bipartite matching with scaling. 

□ 



3.2 Extending the Algorithm to Unrooted Unordered Trees 

Let T = ( Vt,Et ) and P = ( Vp,Ep ) be two unrooted trees. The ALSH be- 
tween P and T could be computed in a naive manner as follows. Select an 
arbitrary node r of T to get the rooted tree T r . Next, for each u £ P compute 
rooted ALSH between P u and T r . This method will yield an O (m 3 n+ m 2 n log n) 
strongly-polynomial algorithm for ALSH on unrooted unordered trees, and an 
0(m 25 nlog(nC)) time algorithm under the similarity assumption with integral- 
ity cost function restriction. In this section we show how to reduce these bounds 
back to 0(m 2 n + mnlogn) and 0(m 15 nlog(nC)) time, correspondingly, by uti- 
lizing the decremental properties of the weighted assignment algorithm. 

The second algorithm, denoted Algorithm 2, starts by selecting a vertex r 
of T to be the designated root. T r is then traversed in postorder, and each 
internal vertex v £ T r is compared with each vertex u £ P. Let y i, . . . , y c ( v ) be 
the children of v £ T r , and let x\, . . . , ay/(u) be the neighbors of u £ P. When 
computing the score for the comparison of u and v we need to take into account 
the fact that, since P is unrooted, each of the neighbors of u may eventually 
serve as a parent of u in a considered mapping of a subtree of P , which is rooted 
in u, with a homeomorphic subtree of t r v . Suppose that node x t , which is a 
neighbor of u, serves as the parent of u in one of these subtree alignments. Since 
Xi is the chosen parent, the children set of u includes all its neighbors but Xi . 
Therefore, the computation of the similarity score for v versus u requires the 
consideration of the highest scoring weighted matching between the children of 
v, and all neighbors of u except Xi . Since each neighbor Xi of it, * = 1, ... , d(u), 
is a potential parent of a subtree rooted in u, our algorithm will actually need 
to compute a series of weighted assignments for the graphs Gi = (Xi U Y. Ei), 
* = !,..., d(u), where Y consists of the children of v in T r , while A, consists of 
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all neighbors of u except X{. Therefore, we define UScores[ v £ Vt,u £ Vp,Xi £ 
neighbors (w)] as follows. 

Definition 5. For each vertex v £ T r , each vertex u £ P, and every vertex 
Xi £ neighbors (u) , UScores[v,u, Xi\ is the maximal LSH similarity score between 
a subtree p^f of P and a corresponding homeomorphic subtree oft„, if one exists. 
Otherwise, UScores[v,u,Xi ] is set to — oo. 

For the graphs Gi defined above, the weight of an edge ( Xj,yj >) in Gi is 
UScores [ yy , Xj , u) . 

The computation of UScores[v, u, xf is carried out as follows: UScores[v, u, xf\ 
is set to the maximum between the following two terms: 

1. The node-to-node similarity value A[v , u], plus the sum of the weights of the 
matched edges in the maximal weighted assignment over Gi. Note that this 
term is computed only if d(u) — 1 < c( v). 

2. The weight UScores[yj,u,Xi\ of the comparison of u and the best scoring 
child ijj of v, updated with the penalty for deleting v. 

Since each node u £ P is also a potential root for P in the sought alignment, 
an additional entry UScores[v, u, (f>\ is computed, which stores the maximal LSH 
similarity score between a subtree of P u and a corresponding homeomorphic 
subtree of t r v , if one exists. UScores[v, u, <f>\ is computed from the weighted as- 
signment for the graph G = ( X U Y,E), where X is the set of neighbors of 
u. 

Upon completion, the maximal LSH similarity score bestscore = 

max UScores[v,u, <f>\ is computed, and every node pair (j £ T,i £ P) with 
v£Vt ,u£Vp 

US cores [j,i,<p\ = best-score is reported as a subtree tf of T r which bears maxi- 
mal similarity to tree P l under the LSH measure. 

Time Complexity Analysis 

The time complexity bottleneck of Algorithm 2 is based in the need to compute, 
for each comparison of a pair u £ P and v £ T, weighted assignments for a 
series of bipartite graphs Gi = (Xi U Y, E) for i = 1, . . . , d(u). This, in contrast 
to Algorithm 1, where only one weighted assignment is computed for each com- 
parison of a pair u £ P and v £ T. However, the next lemma shows that the 
decremental nature of weighted bipartite matching makes it possible to compute 
assignments for the full series of Gi graphs in the same time complexity as the 
assignment computation for the single graph G. In the following, let k = |X| 
and t = \Y\. Note that k — 1 < l. 

Lemma 2. Let G be the bipartite graph constructed by Algorithm 2 for some pair 
of vertices v £ Vt and u £ Vp. Given an optimal assignment of G, the optimal 
assignment of Gi can be computed via one run of a single-source shortest-path 
computation on a subgraph of G. 

Proof. The proof is based on the reduction of the assignment problem to min-cost 
max-flow, as described in Section 2. We will show that after having computed the 
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Fig. 4. The computation of the optimal assignment for Xi U Y . Note that, for the sake 
of graphical clarity, some of the edges of the graph are not shown in the figure. (A) 
The residual graph R for flow /. The dashed lines denote the backward edges and the 
solid lines denote the forward edges. The white circle denotes vertex Xi and the lined 
circle denotes its matched neighbor yi. (B) The residual graph R' received from R by 
removing Xi and its adjacent edges, as well as reversing the flow on edge (s,J/i). (C) 
The augmentation path p in R' is marked by grey edges. (D) The negative cost cycle 
c is marked by curves. Edges [a,b,c,g] correspond to the “correction path” p, edges 
[g,d,e,f] correspond to the negative cost cycle c, and edges [a,b,c,d,e,f] mark the path 
p' with a cost which is less then that of p. 



min-cost max-flow of G', which defines the optimal assignment for G, the optimal 
assignment for Gj can be obtained via one more augmenting path computation. 

Let / denote the minimum cost maximum flow on G' , defining the the optimal 
weighted assignment M, and let R denote the residual graph for / (Figure 4. A). 

Since / is optimal, by Lemma 1, there are no negative cost cycles in R. 
Also, since £ > k, \M\ = k. Let yi be the matched neighbor of in M, i.e., 
(xi, yi) £ M. Let M' denote the matching M\ ( Xi , yi), i.e. the matching obtained 
by removing the pair ( Xi , yi) from M . We say that yi is “newly widowed” in M' . 
Clearly, \M'\ = k — 1. Let f be the flow defined by M' . Let G" denote the 
subgraph of G' obtained by removing vertex Xi and all edges adjacent to Xi in 
G' . Let R' be the “decremented” residual subgraph of R, obtained by removing 
vertex Xi and all edges adjacent to Xi in R, as well as cancelling the flow on edge 
( s,yi ) in R. (Figure 4.B). Our objective is to compute a min-cost max-flow (of 
value k — 1) of G" , which will correspondingly define the optimal assignment of 
size k — 1 on Gj. 
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Clearly, the value of f is k — 1. But is f min-cost? By Lemma 1, if there are 
no negative cost cycles in R\ then /' is a min-cost flow among all k — 1- valued 
flows in G" . 

Otherwise, there are one or more negative cost cycles in R' . Each one of these 
negative cost cycles in R' must begin with edge (s, yi), since this edge is the only 
edge of R' which did not exist in R, and R had no negative cost cycles. 

Among all such negative cost cycles in R 1 , all of which start with edge (s, yi), 
we wish to find the one which contributes the most to decrementing the value 
of /'. Therefore, let p denote the “correction path” for R', defined as the 
the min-cost path in R' among all paths which originate in s, pass through 
the newly widowed vertex yi and end back in s (Figure 4.C). Clearly, p can 
not start from some other edge than ( s , y,) since all t — k vertices in Y which 
were unmatched in M are “out of the game” and will remain unmatched in the 
optimal assignment of size \M\ — 1 on G t . Otherwise, the assumed optimality of 
M would be contradicted. 

We claim that: 

1. If p has negative total cost, then M' is not optimal. 

2. If M' is not optimal, then p is the optimal “correction path” for R' . That is, 
the flow obtained by pushing one unit of flow in R' from s through yi and 
back to s via path p, is the minimal cost maximum value flow for network 
graph G" and defines, respectively, the optimal assignment for Gi. 

The above claims are proven as follows. 

1. This is immediate from Lemma 1, since p is by definition a cycle in R' . 

2. Let /" be obtained from /' by correcting along p, and let R" denote the 
residual graph for flow /". We will prove that /" is min-cost in G" , by 
proving that there are no negative cost cycles in R" . Suppose there was 
some negative cost cycle c in R" . Since R had no negative cost cycles, and 
the only new edge {s,yi) € R' has been saturated by p in R" (therefore 
all previous cycles are broken in R"), we know that c has at least one edge 
which is the reversal of an edge in p. Therefore, cycle c consists of edges in 
R' and one or more edges whose reversals are on p. Therefore, p and c share 
at least one vertex. 

Let p ® c be the set of edges on p or on c except for those occurring on p and 
reversed on c. The cost of p ® c is cost(p) + cost(c), since the © operation 
has removed pairs of edges and their reversals (the score is negated in the 
reversal) in pUc. Since c is a negative cost cycle, cost(p) + cost(c) < cost{p). 
Furthermore, p © c can be partitioned into a path p' originating in s, going 
through y t and ending in s, plus a collection of cycles. (Note that none of 
these cycles includes edge ( s,yi ), since this edge is already occupied by p'). 
Clearly, all the cycles must have nonnegative cost since they consist of edges 
only from R' which has no negative cost cycles by Lemma 1. Thus path p' 
is a “correction path” of cost less than p (Figure 4.D). This contradiction 
implies the lemma. 
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Note that the above proof assumes k < £, in which case \M\ = k. However, the 
special case k = £ + 1 can be easily handled by adding a dummy vertex to Y 
and connecting it by edges to all the vertices in X. The weight of these edges 
should be set to N for some very large number N. □ 

Lemma 3. 

1 . Computing the weighted assignments for the bipartite graphs G\ , . . . , Gk can 
be done in 0(k 2 £ + k£ log£) time. 

2. Under the similarity assumption, computing the weighted assignments for 
the bipartite graphs Gi, . . . , Gk can be done in 0(k 15 £log(£C)) time [12]. 

Theorem 2. 

1. Algorithm 2 computes the optimal ALSH solution for two unrooted unordered 
trees in 0(mrn + mnlogn) time. 

2. Under the similarity assumption, Algorithm 2 yields an 0(m 15 nlog(nC)) 
time complexity. 

Proof. We use the following simple observation: 

m 

Observation 2. The sum of vertex degrees in an unrooted tree P is d(u) = 

14=1 

2m — 2. 

Algorithm 2 computes a score for each node pair (v € T,u £ P). During this 
procedure, the algorithm computes weighted assignments of several bipartite 
graphs. Due to the decremental relationship between these graphs, it is shown 
in Lemma 3 that computing the maximal assignment for all graphs in the series 
can be done in the same time bound taken by computing a single assignment. 
This computation, which is the dominant part of the procedure’s time, can be 
done in 0(d(u) 2 c(v) + d(u)c(v) log c(u)) time. Therefore, the total complexity is 

m n 

X](d(«) 2 c(v)) + d ( u ) c ( v ) iogcO)) obser ^ tIon 1 

14=1 U=1 

m 

d(u) 2 n + d(u)n\ogn) obse ™ tlon 2 o(m 2 n + mnlogn). 

14=1 

Under the (similarity) assumption that all scores assigned to the edges of G are 
integers in the range [— C, . . . , C], where C = 0(n k ) for some constant k, the 
algorithm can be modified to run in time 0(m 15 nlog(nC)) by employing the 
algorithm of Gabow and Tarjan [8] for weighted bipartite matching with scaling 
and [12] for cavity matching. □ 

Note that the first term of the sum 0(m 2 n + mnlogn) dominates the time 
complexity of Algorithm 2, since the second term only takes over when m < log n. 
Therefore, in the next section we will show how to reduce the time complexity 
of the first, dominant term in the time complexity of Algorithm 2. 
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4 A More Efficient ALSH Algorithm for Unordered Trees 

In this section we show how the dominant term in the time complexity of the 
algorithms described in the previous section can be reduced by a log m factor, 
assuming a constant-sized label alphabet. We will use the previous algorithm 
but solve the maximum matching problems more efficiently by employing the 
notion of clique partition of a bipartite graph [6,23]. The modified algorithm, 
denoted Algorithm 3, is the same as Algorithm 2 with the exception that, in 
step 12, we solve the assignment problem differently. Let v denote some vertex 
in T r whose children are Y = yi, , y c ( v ) and let u denote a vertex in P whose 
neighbors are X = x\, . . . ,Xm u )- We now attempt to compute the Assignment 
of G = (A U Y,E). We will partition the edges of G into complete bipartite 
graphs Ci, C 2 , . . . , C c i us t e rs u ■ Let Key(xi) be a vector containing the weights of 
the edges (x», y \ ),..., (xi, y c ( v ))- We do the partition in the following way: First, 
we sort the vertices of X in lexicographic order where the key of a vertex x is 
Key(x). Afterwards, we split X into sets of equal keys X 1 , X 2 , . . . , x clusterSu 
(i.e. , for any two vertices x,x' £ Xi, any edge (. x,yj ) has the same weight as 
edge ( x',yj ) for all vertices yj £ Y). 

Now, for 1 < i < clusters u we set Ci to be the subgraph induced by the 
vertices of X 1 and all their neighbors in Y . We now follow the method of [6, 23] 
and build a network G* whose vertices are V* = XUhU (ci, . . . , c c i us t e rs u > s > £}■ 
The edges are E* = E 1 UE 2 UE 3 where E\ = {[s,#*] : Xi £ X}U {[yi,t] : yi £ Y}, 
£2 = {[xi,Cj] : j < clusters u ,Xi £ Cj}, and £3 = {[cj,yi] : j < clusters u }. All 
edges have capacity 1. Edges from sets E\ and £2 are assigned a cost of zero. 
An edge of type £3 from Cj to yi is assigned a cost which is identical to the cost 
of edge (x, yi) where i£Tis any vertex belonging to the set Cj. The source is 
s and the sink is t. We find a min-cost max flow f* in G* using Fredman and 
Tarjan’s algorithm, and construct from this flow the assignment in G. 

Lemma 4. A min-cost max flow in f* corresponds directly to a min-cost max 
flow in f. 

We denote by D(u) the number of distinct trees in the forest , . . . , P“ 

Lemma 5. The assignment between u £ P and v £ T can be computed in 
0{d{u) (D(u)c(v) + c(v) logc(u)) ) time. 

Time Complexity Analysis 

Algorithm 3 activates procedure C omputeScoresF orT extN ode once for each 
node pair (v £ T,u £ P). The dominant part of this procedure’s work is spent in 
computing the weighted assignment between a bipartite graph with c(v) + d(u) 
nodes and c(v) x d(u) edges. 

Therefore, based on Lemma 5, the total work spent on assignment computa- 
tions is 



m n 

o(£ 5>< u) (cluster s u x c(v)) + c(v) log c(v))) obser = t 

U= 1 V=1 
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m 

d(u) x clustersu x n + d(u)nlogn) 

U= 1 



observation 2 



m 

Q(n Y d(u) D(u) + mnlogri). 

U= 1 



m 

We will now turn to bound the summation 0(^^d{u) D(u)). We start with 

U= 1 

next lemma, which sets an asymptotically tight bound on the number of distinct 
labelled rooted trees in a forest of n vertices, denoted /(n). 



Lemma 6. Assuming constant label alphabet, f(n) = 0(n/ log n). 



Lemma 7. d(u) D(u) = 0{m 2 /logm). 

U— 1 

We have thus proved the following theorem. 

Theorem 3. Algorithm 3 computes the optimal ALSH solution for two unrooted 
unordered trees in 0(m 2 n/ logm + mn log n) time. 

5 Solving ALSH on Ordered Rooted Trees 

A simplified version of the first algorithm, denoted Algorithm 4, can be used to 
solve the ALSH problem on ordered rooted trees. The simplification is based on 
the fact that the assignment problems now turn into maximum weighted match- 
ing problems on ordered bipartite graphs, where no two edges are allowed to 
cross. (We refer the reader to [26] for a discussion of non-crossing plain bipartite 
matching in the context of exact ordered tree homeomorphism.) Also note that, 
in the context of our ALSH solutions, we actually apply a rectangular case of the 
perfect assignment problem on G = (XUY, E), i.e. |A| < |T| and all nodes in X 
must eventually be paired with some node in Y . Therefore, the assignment com- 
putation reduces to the Approximate Weighted Episode Matching optimization 
problem, as defined below. 

Definition 6. The Approximate Weighted Episode Matching Problem: 

Given pattern string X, a source string Y, and a character-to- character similar- 
ity table A[Ex , TV], find among all \X\-sized subsequences ofY the subsequence 

\x\ 

Q which is most similar to X, that is, the sum A[Qi,Xf\ is maximized. 

i= 1 

Lemma 8. The highest-scoring approximate episode occurrence of a pattern X 
of size k characters in a source string Y of size t characters can be computed in 
0(k x £) time. 

Proof. This can be achieved by applying the classical dynamic programming 
string alignment algorithm to a {k + 1 rows by £ + 1 columns) dynamic program- 
ming graph (see Figure 5). All horizontal edges in the graph, corresponding to 
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Fig. 5. Approximate Weighted Episode Matching Calculation. 



character deletions from Y, are assigned a score of zero. All vertical edges in 
the graph, corresponding to character deletions from X, are assigned a score of 
— oo. A diagonal edge leading into vertex (xi,yj) corresponds to the string-edit 
operation of substituting the zth character of X with the jth character of Y, 
and is therefore assigned the score A[i, j], □ 

Time Complexity Analysis 

Theorem 4. ALSH of two rooted ordered trees can be computed in 0(mn) time. 

Proof. Algorithm 4 computes the assignment once for each node pair (v £ T,u € 
P). In this section we showed that, for rooted unordered trees, the dominant work 
of the assignment computation can be narrowed down to 0(c(v) x c(u)) time. 
Therefore, the total time complexity is: 

n m n 

\ > V > s-\/ / \ f \ \ observation 1 \ , / w observation 1 \ 

Z^Z^O(c(v)xc(u)) = 2_^0(mxc(v)) = 0(mxn). □ 

V=1 U—l V=1 
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Abstract. In this paper we study the average behavior of the number 
of distinct substrings in a text of size n over an alphabet of cardinality k. 
This quantity is called the complexity index and it captures the “richness 
of the language” used in a sequence. For example, sequences with low 
complexity index contain a large number of repeated substrings and they 
eventually become periodic (e.g., tandem repeats in a DNA sequence). In 
order to identify unusually low- or high-complexity strings one needs to 
determine how far are the complexities of the strings under study from 
the average or maximum string complexity. While the maximum string 
complexity was studied quite extensively in the past, to the best of our 
knowledge there are no results concerning the average complexity. We 
first prove that for a sequence generated by a mixing model (which in- 
cludes Markov sources) the average complexity is asymptotically equal to 
n 2 / 2 which coincides with the maximum string complexity. However, for 
memoryless source we establish a more precise result, namely the average 
string complexity is n 2 /2 — nlog fc n+(l + (l— 7 )/ In k+(j>k(log k n)+o(l))n 
where 7 « 0.577 and <j>k{x) is a periodic function with a small amplitude 
for small alphabet size. 



1 Introduction 

In the last decades, several attempts have been made to capture mathematically 
the concept of “complexity” of a sequence. The notion is connected with quite 
deep mathematical properties, including the rather elusive concept of random- 
ness of a string (see, e.g., [14, 19,22]). 

In this paper, we are interested in studying a measure of complexity of a se- 
quence called the complexity index. The complexity index captures the “richness 
of the language” used in a sequence. Formally, the complexity index c(x) of a 
string x is equal to the number of distinct substrings in x (see e.g. [20]). The 
measure is simple but quite intuitive. Sequences with low complexity index con- 
tain a large number of repeated substrings and they eventually become periodic. 
However, in order to classify low complexity sequences one needs to determine 
average and maximum string complexity. In this paper we concentrate on the 
average string complexity. 

We assume that sequences are generated by some probabilistic source (e.g., 
Bernoulli, Markov, etc.). As a consequence, the number c(x) of distinct sub- 
strings can be modeled by a random variable over a discrete domain. Given a 
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source emitting strings of size n over an alphabet of cardinality k , we call this 
random variable C n ^ k - The main objective of this study is to give a detailed 
characterization of the expectation of the random variable C n 

A related notion is that of the l- subword complexity or l- spectrum c l {x ) of a 
string x, which is the number of distinct Z-mers in x, for 1 < l < \x\. We define 
C l n k to be the random variable associated with the number of distinct words 
of size l in a random string of size n over an alphabet of cardinality k. Clearly, 

r< 

n,k — Z^= 1 

The idea of using the complexity index or the /-spectrum to characterize se- 
quence statistically has a long history of applications in several fields, such as 
data compression, computational biology, data mining, computational linguis- 
tics, among others. 

In dictionary-based data compression, the average length of the pointer is 
connected with the expected size of the dictionary which in turns depends on the 
number of distinct subwords (see, e.g., [6]). Low-complexity strings contain more 
repeated substrings and therefore one can expect them to be more compressible 
than strings with high complexity index. For example, in [14] bounds between 
subword complexity and Lempel-Ziv complexity are established. 

In the analysis of biosequences, the problem of characterizing the “linguistic 
complexity” of DNA or proteins is quite old. In the early days of computational 
biology, it was almost routine to compute the number and/or the frequency of 
occurrences of distinct Z-mers and draw conclusions about the string under study 
based on those counts (see [23, 7, 15, 12, 16], just to mention a few). 

In these and several other application domains, the typical problem associ- 
ated with the complexity index is to determining whether a particular sequence 
x has a statistically significant complexity index. An example of a significance 
score proposed in a recent paper by Troyanskaya et al. [31] in the context of the 
analysis of prokaryotic genomes, is the following 

n 

s(x) = c(x) — max/CV^fe} = c(x) — min (k\ n — i + 1). 

i=l 

Here, the authors compare the observed complexity c(x) with the maximum 
possible complexity for a string of size n over an alphabet of cardinality k. 
Note however, that the score disregards both the distribution of C n .k, and the 
probabilistic characteristics of the source. 

A more statistically-sound approach would entail the following steps. First, 
select an appropriate model for the source that emitted x (Bernoulli, Markov, 
etc.). Then, measure the statistical significance as a function of the discrepancy 
between the observed complexity c(x) and the model-based expectation. 

This approach of standardizing the value of an observation with respect to 
the expectation and the standard deviation of the associated random variable is 
common practice in Statistics. The underlying assumption is that the random 
variable is normally distributed. The standardized z-score for the complexity 
index would be 

, , _ C( X ) - E (C'n.fc) 

z(x) — 

^Var (C n , k ) 
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for a given string x. Although we do not know under which conditions C n ,k is 
distributed normally, such a score is nonetheless more sound that other ad hoc 
scores. 

A similar situation takes place when describing the significance of other 
events in texts, like the number of occurrences, periodicities, etc. Although the 
normality of the corresponding random variables can be proved under specific 
conditions, there is a consensus that standardized z-scores should be always 
preferred over simpler scores (see, e.g., [25] and references therein). 

Given an observation x of the source, we would like to compute the statistical 
significance z(x) of its complexity index. As far as we know, however, the mo- 
ments E(C' n ,fe) and Var(C'„ j fe) have never been characterized before. The goal of 
this paper is to study E(C' rlj fc). The asymptotic analysis of the variance remains 
an open problem. 

In order to proceed, we need to introduce some standard concepts and no- 
tation about strings. The set £ denotes a nonempty alphabet of symbols and a 
string over £ is an ordered sequence of symbols from the alphabet. In the rest 
of the paper, we assume that \£\ = k. Given a string x, the number of symbols 
in x defines the length |x| of x. Henceforth, we assume |x| = n. 

The concatenation (or product) of two strings x and y is denoted by xy, and 
corresponds to the operation of appending y to the last symbol of x. Let us 
decompose a text x in uvw, i.e. , x = uvw where u,v and w are strings over £. 
Strings u, v and w are called substrings, or words , of x. 

We write xuu 1 < i < |a;| to indicate the i-tli symbol in x. We use xuj] 
shorthand for the substring x^xy + y ■■■Xy\ where 1 < i < j < |x|, with the 
convention that xu : p = xuy Substrings in the form xn, u corresponds to the 
prefixes of x, and substrings in the form xu tU ] to the suffixes of x. 

Finally, we recall that the subword complexity function c l (x) of a string x 
is defined as the number of distinct substrings of x of length l. The quantity 
c(x ) = Ym= i c \ x ) is the complexity index of x. Observe first that c l {x) is upper 
bounded by min(fc , n — l + 1) since there are precisely n — l+l words of length 
l, of which at most k l can be distinct. Therefore 

n 

c(x) < min(fc i , n — l + 1). 
l = l 

Note that if we choose n < k, then min (k l ,n — l + 1) = n — l + 1 for all 1 < l < n. 
Therefore when n < k, the bound simplifies to c(x) < n(n + l)/2. 

The value c(x) is strongly connected with the structure of the non-compact 
suffix trie for x. A non-compact suffix trie of a string £ is a digital search tree 
built from all the suffixes of x. The trie for a string of size n has n + 1 leaves, 
numbered 1 to n + 1, where leaf n + 1 correspond to an extra unique symbol 
$ fL £, and each edge is labeled with a symbol in £. No two edges outgoing from 
a node can have the same label. The trie has the property that for any leaf i, 
the concatenation of the labels on the path from the root the the leaf i spells 
out exactly the suffix of x that starts at position i, that is x^^y The substrings 
of x can be obtained by spelling out the words from the root to any internal 
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Fig. 1. The non-compact suffix trie T x for x = abaababa. There are 24 internal non-root 
nodes, therefore c(x) = 24. 



node of the tree. In other words, each internal node (except the root) is in one- 
to-one correspondence with each distinct substring of x. As a consequence, the 
complexity index c(x) can be obtained by counting the non-root internal nodes 
in the non-compact suffix trie for x. This would take, however, 0(n 2 ) time and 
space. The non-compact suffix trie for abaababa is illustrated in Figure 1. 

A faster solution to compute c( x) involves the use of the suffix tree T x of x. 
The suffix tree can obtained by “path-compression” of the non-compact suffix 
trie, that is, by deleting internal nodes with only one child and coalescing con- 
secutive edges in a single edge labeled by the concatenation of the symbols. If 
one deletes unary nodes only at the bottom of the non-compact suffix tries, the 
resulting tree is called compact suffix trie. The compact suffix trie and the suffix 
tree for abaababa are shown in Figure 2. 



aba ababa$ 




a ba ababa$ 




Fig. 2. The compact suffix trie (LEFT) and the suffix tree (RIGHT) for x = abaababa. 
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In practice, suffix trees can be build without the need of building the suffix 
trie first. In fact, several 0(?rlog |I7|) constructions are available. The algorithm 
by McCreight [21] and the one by Chen and Seiferas [9] are variation of the 
Weiner’s algorithm [34]. Note that these algorithms take only linear time for 
finite alphabets. All these constructions are off-line because they process the 
text right to left. An on-line algorithm by Ukkonen [32] achieves also linear time. 
Recently, Faraclr [11] proposed an optimal construction for large alphabets. 

The unary nodes which have been removed in the compaction process are 
called implicit nodes. An edge labeled by a string of length m + 1 has m implicit 
nodes. The complexity index c(x ) of a text-string x can be computed on the 
suffix tree by counting the number of implicit and explicit (non-root) nodes in 
the tree. As a consequence, the c(x) can be computed in 0(n ) time and space. 
The relation between non-compact suffix tries and suffix trees will be used later 
in paper to obtain the leading term of the complexity for a general probabilistic 
model. 

Finally, we briefly describe some recent results regarding the maximum of 
c l (x). It is known that c l ( x) is also strongly connected with the structure of the 
suffix tree T x . De Luca [10] proved that 

max{cr (a;) : 1 < l < n} = n — max{R, K} + 1 = n — max{L, if} + 1 



where K is the length of the shortest suffix of x that occurs only once, if is the 
length of the shortest prefix of x that occurs only once, R — 1 is the height of 
the deepest branching node in the suffix tree T x and L — 1 is the height of the 
deepest branching node in the suffix tree T x r. De Luca also gave a closed form 
for c(x) 



c(x) = 1 + 



(n + K)(n 
2 



K+ 1) 



1^1 R 

3 — 2 *=0 



where g(j, i) is the count of the words of length i which are branching nodes of 
the suffix tree with at least j children [10]. 

Shallit [26] derived a simpler bound for c(x) for binary alphabets (k = 2) 



c( x) < 



(n — d + l)(n — d) 



+ 2 d+1 - 1 



n 

T 



where d is the unique integer such that 2 d + d — 1 < n < 2 d+1 + d. More 
importantly, he proved that the upper bound is attained for all n by using a 
property of the de Bruijn graphs. An extension of this result to larger alphabets 
was recently described in [13]. 

Kasa [17] studied the probability distribution of random variable associated 
with the complexity index for the family of words of length equal to the size of 
the alphabet [n = k). He proved several facts about the random variable Ck,k , 
and he also conjectured a property of the smallest value of the complexity after 
which all the frequencies are non-zero. The conjecture, proved later by Leve and 
Seebolcl [18], states that if one chooses k = l ^ 1 '> +2 +i where l > 2 and 0 < i < l 
then 
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P (C k , k = t) > 0 for all t such that ^ ^ + 31 + 2 + i(l + 1) < t < ^ + — 1 

In this paper we mostly deal with the average string complexity. First, for a 
general probabilistic model (e.g., Markov source) we prove that the average com- 
plexity is asymptotically equal to n 2 / 2 which coincides with the maximum string 
complexity. We shall strengthen this result for unbiased memoryless sources. In 
particular, we prove that 

E(C„, fc ) = (”2 X ) - nl °Sfc n + (^ + + </>fc( lo gfc n)^n + 0(\/nAogn) 

where 7 « 0.577 and <f> k ( x) is a continuous function with period 1 and small 
amplitude for small alphabet size (e.g., \(j) 2 (x)\ < 2 • 10 -7 ). To prove this result 
we use the Stein-Chen method together with the Mellin transform approach. 

2 Main Results 

As a warm-up exercise, we studied the closed forms for E (C n ^) and Var ( C n ik ) 
for short strings (e.g., n < 5) for a symmetric memoryless source. Some facts 
about P (C nt k = t ) are immediate. For example 

P (Cn.fc < n) = 0 
P (C n , k =n) = fc 1 -" 

P ^Cn.fe > ^ = 0 when n < k 

P {^Cn,k > miii(/c', n — i + 1)^ = 0 when n > k 

Following the lines by Kasa [17], we were able to obtain closed form for the 
cases shown in Table 1 assuming a symmetric Bernoulli model for the source. 
Given the discrete distribution of C n , k one can easily compute the expectation 
and the variance of C n tk . 

Corollary 1. The expectation and the variance of the random variable C n ^ k for 
2 < n < 5 over any alphabet of size k, under a symmetric Bernoulli source is 

E (C 2 , k ) = 3 - (1/k) 

E (C 3 ,/c) = 6 — (3/k) 

E (C 4 , k ) = 10 - (6/k) + (1/k 2 ) - (1/fc 3 ) 

E (C 5 , k ) = 15 - (10/fc) + (4/A: 2 ) - (6/fc 3 ) + (2/fc 4 ) 

Var (C 2 , k ) = (k- l)/k 2 
Var (C 3 , k ) = 3 (k - 1 )/k 2 

Var ((74,*) = (k- 1)(6A: 4 - 5A; 3 + 12A; 2 - k + 1 )/k 6 
Var ( C 5tk ) = 2(k - 1)(5 k e - 10fc 5 + 33fc 4 - 28 fc 3 + 16A: 2 - lOJfc + 2)/k 8 



and 
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Table 1 . The probability distribution of C TO ,fc for 2 < n < 5 over any alphabet of size 
k, under a symmetric Bernoulli source. 




As it turns out, obtaining a closed form for E (C n ,fe) and Var (C n ,fc) for any 
n,k is a very challenging problem. In practical applications, moreover, having the 
closed form only for small values of n would be of limited interest. It is certainly 
more valuable to study the behavior of the moments of C Ut k asymptotically, that 
is, when n is very large. 

The main result of our study is a characterization of E (C n ,k) for large n. 
In our first result we show that for quite general sources the average complex- 
ity asymptotically coincides with the maximal string complexity. We consider 
mixing sources in which the probability of two events A and B defined on two 
substrings separated by g symbols is bounded as follows: (1 — tp(g))P(A)P(B) < 
P (AB) < (1 + t/:(5 , ))P(A)P(f3) where the mixing coefficient V’(ff) - ’ ► 0 as g — > oo 
(cf. [28] for a detailed definition). 

Theorem 1. Let C n ^k be the complexity index of a string of length n generated 
by a strongly mixing stationary source. Then, for large n, 

E C n ,k = ^ ^ ~ O(nlogn). 

Hence C n ,k = n 2 / 2 + O p (nlogn), i.e. (n 2 / 2 — C n ^)/n\ogn is bounded in prob- 
ability. 

Proof. We start with a simple observation. For a given sequence x of size n, build 
a non-compact suffix trie and a compact suffix trie. Recall that in a compact 
trie we collapse all unary links at the bottom of the suffix trie. In other words, 
in a compact suffix trie a path from the root to an external node (containing 
a suffix) is the minimal prefix of any two suffixes that distinguishes them. Also 
recall that the string complexity of x is the number of non-root internal nodes 
in the non-compact trie. We shall argue below that the most contribution to the 
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string complexity comes from the nodes that are in the non-compact trie but 
not in the compact trie. The upper bound follows immediately from 

C n ,k < n(n + l)/ 2 . 

To find a matching lower bound, we consider the compact and non-compact 
suffix tries. We know that a typical depth and height in a compact suffix trie is 
O(logn). More precisely let H n be the height of a compact suffix tree. It was 
shown in [27] that (at least for if>(g) satisfying ’^2 g>0 if 2 (g) < °°) H n /lnn — > 
2/h\ a.s., where hi is the Renyi’s entropy (cf. [28, p. 157]). More precisely, the 
proof shows (for any ip(g) — > 0 ) that for any e > 0 

P [h h < ^-(l + e)logn^ = 1 - 0 (logn/n e ). ( 1 ) 

We claim that the main contribution to C n< k comes from strings that are in the 
non-compact trie but not in the compact trie. In fact, the i-th suffix string has 
n — i internal nodes of which at least n — i — H n are not in the compact trie. 
These nodes all correspond to unique substrings of x, and thus 

n 

C n ,k > _ ^ Hn) = \n(n + 1) - nH n . 

i— 1 

By (1), for a suitable constant B and large n, P (H n > Blogn) < n -1 and thus 
E^ln(n + 1) — Cn’k'j < n~EH n < n[B log n + nP ( H n > B log n)) = O(nlogn), 
which completes the proof. 

However, from theoretical point of view the most interesting case is when the 
string is generated by an unbiased source (such a source should have the largest 
complexity). In this case, we are able to characterize very precisely the average 
complexity. 

Theorem 2. Let C n ,k be the complexity index of a string generated by an un- 
biased memoryless source. Then the average l-subword complexity is 

E (C l nik ) = k\ 1 - e~ nk ~ l ) + 0(1) + 0(nlk~ l ). (2) 

Furthermore, for large n the average complexity index becomes 

E (C n ,k) = ^2 ^ -nl °gfc n + + <j> k (\og k nf) n + Oiy/nlogn) 

where 7 « 0.577 is Euler’s constant and 

«*>-! hz'h-gi)**- 

0 

is a continuous function with period 1 . \4>k(x)\ is very small for small k: \<f> 2 (x)\ < 
2 • 10- 7 , \(t> 3 (x)\ < 5 • 10" 5 , |^4 (a) | < 3 • IQ" 4 . 
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Interestingly enough, the term O(n) of the average complexity contains a 
fluctuating term </>k(x ). (Note that </> 2 (log 2 n) equals — Po(n) in [28, p. 359].) 
The formula 

\r(-l-iy)\ = \r(-iy)/(-l-iy)\= (y(l + y 2 )smh(7ry)/n) (3) 

[28, p. 42] shows that the coefficients in (f>k are small and decrease geometrically. 
Numerical evaluations give the bounds for k = 2, 3,4 stated in the theorem and 
also, for example, \(f>k(x)\ < 0.01 for fc < 12 and \cj)k(x)\ < 0.1 for k < 200. Even 
for very large k, this term is not very large; we have, still using (3) (we omit 
the details), \(j>k{x)\ < 0.5 for k < 10 9 , and |^fc(a;)| < lnln(fc)/7r for all k. The 
fluctuating phenomenon is quite common in asymptotics of discrete problems. 



3 Proof of Theorem 2 

In this section we prove Theorem 2, however, to simplify our notation we restrict 
ourselves to the binary case k = 2. Extension to k > 2 is straightforward. 

Recall that C l n 2 is the number of distinct substrings of length l in our random 
binary string of size n, and let 



A,=E(C£ i2 ). 

Thus C n> 2 = EHi Ct 2 and E(<7„, 2 ) = i Define 

Si = Ai + l- l-2 l (l-e~ n2 ~ l ) (4) 

= A t - (n + 1 - 0 + 2 l ( e~ n2 ~ l - 1 + n2~ l ) . ( 5 ) 

Then 

n n 

E (C n , 2 ) = E = E (( n +l~l) + Si-2 l {e~ n2 ~ l - 1 + ,12-')) 

1=1 1=1 

( i -I \ n n 

n 2 )-E 2 '( e_ " 2 " i - 1 + n2_ 0+E' 5 '- ( 6 ) 

' i=i i=i 

Below we will use several times the estimates (for x > 0) 0 < 1 — e~ x < min(l, x) 
and 0 < e~ x — l + x< min(a;, x 2 ). In particular, 

0<2*(1 — e-" 2_i ) <min(2 l ,n), (7) 

0 < 2'(e _ " 2_! — 1 + n2~ l ) < n 2 2~ l . (8) 

We begin by estimating Si. First (for short strings, i.e., for small l), we use 
1 < C l n 2 < 2 l , and thus 0 < Ai < 2 l and, using (4) and (7), 



Si = 0 ( 2 '). 



(9) 
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For long substrings, i.e., for large Vs, there is another simple estimate. Clearly, 
0 < C l n 2 < n — l + 1. Observe that 



E(n - l + 1 - C l n 2 ) < E (| {(i,j) '■ i < j and } |) 

— l + P 



2~ l < n 2 2~ l 



i.e. 

0<n~l + l~Ai< n 2 2~ l . 

Hence, using (5) and (8), 

Si = 0{n 2 2~ l ). (10) 

Note that (9) and (10) easily yield Y^a=i h = O(n), which by (6) yields 
E(CVi, 2) up to 0(n). In order to obtain our more precise result, we need a better 
estimate of Si when 2 

Let i be a sequence generated by an i.i.d. source (i.e., p(0) = p(l) = 1/2) 
and let us define 



P (n,l) = P(a?[i,j] 7^ i] for j = 2 , . . . ,n - l + 1) 

By counting each repeated substring in x only the last time it occurs, it is easily 
seen (by shift invariance) that 



A, = £P(m,Z) (11) 

m—l 

Now fix l and to, and let us define Ij = l[aJ[i,/] = i] ] for j = 2, . . . , m—l+l. 

Then 



E(/j) = p {Ij = 1) = P[aJ[i,j] = x\j,j+l-l]\ = 2 1 for every j > 2. 

In the next lemma we establish that Z; and Ij are uncorrelated when i,j > l. 

Lemma 1. If i,j >1 + 1 and i ^ j then E (Iilj) = P(a’[! ;] = i] = 

x \j,j+l- 1]) = 2 

Proof. Assume i < j. Scan the three substring left to right. Each bit in 
is either a fresh random bit or coincides with an earlier bit in ■ In any 

case, it fixes one new bit each in Xun and xuj + i_i i, so the probability of success 
in this step is 2~ 2 . 

Observe that, if j > l+l, to condition on Ij is the same as to change every bit 
in xtjj+i- 1] to the corresponding bit in xnn, leaving all other bits untouched. Let 
x^ be the resulting string, and let Jij = = x^p i+l _^]. Clearly, Jjj = Ii 

if I* — j I > l- Note also that Lemma 1 yields, when i j and i.j > l, 

E (Jij) = E(Z i | Ij = 1) = = 2~ l . 
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Let us now set, for fixed integers l and m > 21, 

m - l -\- 1 

E 7 ^- 

3=1 + 1 

Let cItv{X, Y) be the total variation distance between random variables X and 
Y . Then, with Po(A) being the Poisson distribution with mean A, by the Stein- 
Chen method (cf. [5, page 24]) we find, with A = E W = (to — 21 + l)2 _i , 



d TV (W,Po((m 2l + l)2~ 1 )) < — M)^ e(Jj)e 

3 

<^E E ^)( E ( 7 i)+ E (E(/i)+E(J i: 

3 0<|i-j|<2 



Ij + E Jij) 

i+3 



2 — 1 — 1 



< - V 21 - 2-2 

A 

3=1 + 1 

= Al2~ l . 



-21 



In particular, 

|p(W = 0) - e -( m - 2i + i)2- 



<d T v(W,Po((in-2l + l)2- 1 )) =0(l2~ l ). (12) 



Moreover, by the first moment method 



P (E 7 ^ 0 ) <(i-l)E(/ j ) = (/-l)2"‘. 
C= 2 



(13) 



Observe that 



P(m, 0 = P |^E !j+ W = °J • 

Then by (13) and (12) 

P(m, Z) = P(W = 0) + 0(l2- l ) = e~ {rn - 2l+1)2 ~ l + 0(l2~ l ) = e - {m - l)2 ~ l + 0{l2~ l ). 

(14) 

We have assumed to > 21. However, by the first moment method directly, the 
same estimate holds for l < to < 21 too. 

We thus have, by (11) and (14) and summing a geometric series, 

1 _ g-(n+l -l)2~ l 



+ = E P ( m > 



m=l 



+ 0(nl2~ l ). 
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Since 

1 _ e -(n+l-l)2-‘ X _ e -(n+l-l)2-‘ 

1 - e~ 2 ~ l " 2~\l + 0(2- 1 )) 

= (2 / + 0(l))(l-e-( n+1 -^ 2_! ) 

= 2 l (l - e -(«+i- 02 - ! ) + o{n2~ l ) 

= 2 l (l - e~ n2 ~‘) + 0(1) + 0(n2~ l ), 



we find 

Ai = 2'(1 - e~ n2 ~ l ) + 0(1) + 0(nl2~ l ) 
which proves (2). Thus by (4) 

5i = 0(l) + 0(nl2~ l ). (15) 

Using (9) for 1 < l < log 2 V n In n, (15) for log 2 V n In n < l < 21og 2 n, and 
(10) for 2 log 2 n < l < n, we obtain 

n 

Y^ Si = O(nlogn) 1 ^ 2 . (16) 

l 



We turn to the first sum in (6). Let, for x > 0, 

,, , e~ x - 1 + xl[x < 1] 

= x ■ 

Then \f(x)\ < x for 0 < x < 1 and \f(x)\ < 1/x for x > 1. Hence, 

n n 

Y 2 l (e~ n2 ~ l - 1 + n2~ l ) = n Y (/0 2 ”') + 1 [2* < n]) 

1 = 1 1=1 

n 

= nYf( n2 ~ l ) + n L lo S2 n \ 

1 = 1 



oo 

= n Yj f( n2 ~ l ) + 0(1) + n[log 2 nj 

1 = — OO 

= nip(n) + n log 2 n + 0(1), 
where, with (x) = x — , the fractional part of x, 

OO 

i>(x) = Y f( x2 ~ l ) - ( l °g 2 x ), x>0. 

l =— oo 



(17) 



(The series converges by the estimates above.) It is easily verified that if) is 
bounded, continuous (also at powers of 2), and periodic in log 2 x, i.e., ijj(2x) = 
ip(x). Hence ip( 2 V ) has period 1 and may be expanded in a Fourier series 

OO 

v>(2«) = y c ^ iuv ( 18 ) 



v = — OO 
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where 



C„ = [ e~ 2 ™ y i>(2 y )dy 

Jo 

pi OO 

= / e~ 2 ™y( J2 f(2 y ~ l )-y)dy 

0 l— — OO 

OO pi pi 

= V / e- 2 ™ y f(2«- l )dy- / e~ 2 ™y y dy 

l=-oo J ° Jo 

/ OO pi 

e“ 2 ™yf(2 v )dy- / e- 2 ™ y ydy. 

-oo J 0 



( 19 ) 



Further, changing variables back to (0,oo), 

/ °° 1 /*°° Ary 1 

J 2 ^yf(^)dy= — J x- 2 ™/1“ 2 /W 7 = — Ai[/(*);-27r* I //ln2], 

(20) 



where A4[f(x); s] = / 0 °° a: s_1 /(a;) da; is the Mellin transform of / at the point s 
(in our case s = —2niu/ln2. (See [28, Chapter 9] for definition and basic prop- 
erties.) Since \.f(x)\ < min (a:, a: -1 ), the Mellin transform A i[f(x)\z] is analytic 
in the strip — 1 < $lz < 1. In the smaller strip 0 < 5iz < 1 we have, by [28, Table 
9.1], 



M[f{x)\z] =m( 6 x +M( 1[x < 1 ];z) 

= M(e~ x +M(l[x < l]-,z) (21) 

= r(z-i) + 

z 

By analyticity, this extends to the strip — 1 < ( ftz < 1. In particular, taking the 
limit as z — > 0, 



A<[/(x),0]= lim(lg±^ + 1) = jim r(1 + z 8) _ / + ^ = -(r'(l) + 1) = 7~ I- 



z(z — 1) 2 

Moreover, elementary integration yields 
[ e -2ni V y ydy _j 



— 1/2-Kiv, v 0, 

1 / 2 , u = 0. 



By (19)-(23), 



_ 1 r ( i 2mv\ 

v ~\n2 r \ In 2 / 
7-1 



v ^0, 



co = 



In 2 



- 1 / 2 . 



(22) 

( 23 ) 
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The theorem now follows, with <f>(x) = —(ip(2 x ) — Co), from (6), (16), (17) and 

(18). 

The numerical bounds for k < 4 are obtained from (3); for small k, \c v \ 
is dominated by |ci| which is very small. 
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Abstract. The point set pattern matching problem is, given two sets 
“pattern” and “text” of points in Euclidean space, to find a linear trans- 
formation that maps the pattern to a subset of the text. We intro- 
duce an approximate point set pattern matching for axis-sorted point 
sequences that allows a translation, space insertions and deletions be- 
tween points. We present an approximate pattern matching algorithm 
that runs with pattern size n and text size m in 0(nm 2 ) time in general, 
and in 0(nm) time if distances between two contiguous points in texts 
and patterns are finite. A variant of the four- Russian technique achieving 
0(nm/ log n + n log n) time is also provided. Furthermore, as a natural 
extension we present an approximate point set pattern matching on the 
plane, and give a polynomial-time algorithm that solves this problem. 

Keywords: point set pattern matching, edit distance, approximate 
matching, musical sequence search 



1 Introduction 

The point set pattern matching problem by linear transformations asks, given 
a pair of a “pattern” set and a “text” set of points in Euclidean d-dimensional 
space, to find a linear transformation that exactly maps the pattern to a subset of 
the text. A rich class of problems, such as a motion detection of rigid bodies from 
images, relate to this fundamental problem. In [14], Rezende and Lee presented 
an exact point set pattern matching algorithm for the 1-dimensional case, and 
extended it to d-dimensional one with a circular scan technique that runs in 
0(nm d ) time with the pattern size n and the text size m. 

The idea of approximate matching is also a powerful tool for the point set 
pattern matching in practical applications. For example, in a two-dimensional 
gel electrophoresis analysis, we would like to determine whether a set of spe- 
cific proteins, which is observed as a set of spots on a gel media plane, occurs 
among spots obtained from a sample material. Distances between spots depend 
on experimental conditions, which drift during an experiment. Hoffmann et al. 
[4], and Akutsu et al. [1] coped with this problem by approximate point set 
pattern matching schemes and heuristic algorithms. Unfortunately, their edit 
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distances by non-uniform transformations lead the matching problems to be 
NP-complete [1, 5] even for the 1-dimensional case. 

Makinen and Ukkonen have discussed in [8, 11] a 1-dimensional approximate 
point pattern matching, which is an extension of the approximate string pat- 
tern matching, and applied it in a heuristic for the gel electrophoresis prob- 
lem. Their algorithm runs in 0(nm) time with space insertions and deletions 
between points. Additionally substitutions for sets of points are handled in 
0(nm 3 + n 2 m 2 + n 3 m) time. Also in [8] they have introduced an approximate 
2-dimensional point matching problem, and have been proved that the problem 
is NP-harcl. 

The origin of our motivation is a melody search problem for digital musical 
scores such as internet contents, ringer melodies of mobile phones, streaming data 
for on-demand “Karaoke” systems, etc. So far, pattern matching and similarity 
measurement for musical score sequences have often been done by extending text 
search techniques (e.g. [6,13]). However, string representations are not suitable 
in musical pattern search. For example, the same melodies in different keys are 
described as completely different strings. Thus in [13] the keys of scores are 
assumed to be normalized in advance. Kadota et al. [6] have prevented this 
by generating all possible transposes of the target passage. In recent works [7, 
10], this difficulty is treated by string pattern matching algorithms that allow 
symbols in patterns to transpose, i.e. relative shifts for all symbols, by regarding 
symbols as numbers. 

A digital musical score such as a MIDI sequence can be considered as a series 
of notes, where each note roughly represents a note-on time, duration and a 
note-number describing a pitch in half-steps. Then a score simply describes a set 
of points in a plane, which is like a cylinder of musical boxes or a piano-roll of 
a classic player piano [12]. In this representation, key transposes are naturally 
treated as exact translations in the pitch axis. Differences in tempo would be 
treated by a scaling factor along the time axis. Since a pattern provided by 
human may lack some notes, such as notes forming chords, we would like to find 
a pattern as a subset of a score. To attain these advantages, we formulate as in 
[15] a musical pattern matching as a point “set” pattern matching problem. 

In this paper, we define an approximate point set pattern matching for axis- 
sorted point sequences that allows space insertions and deletions between points, 
and show a DP algorithm that runs in 0(nm 2 ) time and 0(nm) space with pat- 
tern size n and text size m. Then, we show an improvement of the DP algorithm, 
and prove that it achieves 0(nm) running time if point sequences have “finite 
resolution,” i.e., the maximum distance between two contiguous points is fixed. 
It enables us to apply the algorithm for a huge size of text sequences. Also, 
this assumption allows us to apply ideas in approximate string matching: If 
point sequences have finite resolution, a tabular computation can be performed 
in 0(mn/ log n + nlogn) time by a speed-up technique analogous to the four- 
Russian technique [2,3]. 

Furthermore, we introduce an edit distance and an approximate point set 
pattern matching for point sets in the plane, as a natural extension of our ap- 
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proximate point set pattern matching in sequences. Then a DP-like algorithm 
that runs in 0(n 2 m 4 ) time and 0{n 2 m 2 ) space is devised. To our knowledge, this 
is the first non-trivial 2-dimensional approximate point set matching providing 
a polynomial-time algorithm. 

2 Point Sequences and Approximate Matching 

As in [8], the 1-dimensional case is instructive and interesting itself. Furthermore, 
there are considerable amount of practical applications where points belong to 
two or higher dimensional space but are ordered by one dominating axis. For ex- 
ample, in musical sequence search, notes of scores must be treated as time-series 
sequences by “note-on time.” In 3-D matching of protein molecule structure, a 
molecule is often abstracted as a folded chain of amino acids, and thus the order 
of amino acid residues is fixed as a sequence. 

Also we assume that in a 1-D point sequence there are no two or more points 
that have the same value. This does not spoil our discussion since in such a case 
points have additional dimensions and thus we can choose an appropriate point 
by other attributes. The notions required for our problem can be formalized as 
follows. 

A 1-dimensional point sequence, or point sequence in short, is a finite sequence 
of strictly-increasing positive integers. For a point sequence p, the notion \p\ 
refers to the number of points. A point sequence p' = (p^, ... ,p' r ) is a subsequence 
of p = (jp\, . . . , Pn) if there is a subsequence 1 < fci < • • • < k r < n of indices 
such that p\ = pk, for all 1 < * < r. 

Let p = (pi, . . . ,p n ) and t = (ti , . . . , t m ) be point sequences with n < m. We 
say p matches t if there is a subsequence t' = (tk 1 , . . . , tk r ) of t such that, for all 
1 < i < n, pi — pi - 1 = tki — tk i _ 1 ■ In this case, p and t are said to be a pattern 
and a text, respectively. Also we say that p occurs in t at k\, and the index k\ 
is an occurrence position of p in t. 

This definition of matching is due to the point set matching by [14]. We 
extend this to an approximate version. Firstly, we introduce an edit distance 
between sequences having the same number of points. 

Definition 1. Edit Distance between Point Sequences 

Let p and q be point sequences of the same size n. The edit distance d(p, q) 
between p and q is d(p , q) = 0 if n < 1, and 

n 

d(p,q) = ^2 I Pi Pi—i ~ ( hi - hi- 1 )| • 

i- 2 

For the brevity of discussion, we define d{p, q’) = oo for |p| \q'\. 

This edit distance does not satisfy the “identity” condition with the straight- 
forward equivalence. It is due to the lack of the distance |gi — pi\ between the 
first points. Anyway, the following fact supports that our formulation is mathe- 
matically well-defined. 
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For any point sequences p, q and r of size n, the edit distance satisfies 

(a) d(p,q) > 0, 

(b) d(p,q) = d(q,p), and 

(c) d(p , q) < d(p , r) + d(r, q) for any point sequence r. 

Now the criteria to find a subsequence of a text sequence that approximately 
matches to a pattern sequence is formalized as follows. 

Definition 2. Approximate Point Set Pattern Matching 

Let p = (pi, . . . ,p n ) be a pattern point sequence, t = (t i, . . . ,t m ) a text point 
sequence, and e > 0 a positive integer. We say p approximately matches t at an 
index k within error e if there is a subsequence t! of t such that t' begins with tk 
and d{p,t') is no more than e. 

Unlike the matching defined in [8] , we must find a best subset of points from 
the given text of points. Fortunately, we can use recurrence relations to find 
a subsequence, and it allows us to apply a version of the “traditional” tabular 
computation method. In the following, we show the recurrence that holds for the 
minimum edit distances between a pattern point sequence and a subsequence of 
a text point sequences. 

Let p be a pattern point sequence of size n, and t a text point sequence of 
size m. For all 1 < * < n and 1 < j < to, we define the notion D(i,j) as the 
minimum value of the edit distances between (pi, . . . ,pi) and any subsequence 
t! of t such that \t'\ = i and t! ends with the point tj. Then clearly D(l,j) = 0 
for all 1 < j < m, and by the definition of the edit distance D(i,j) = oo for 
i > j. Furthermore, the following recurrence holds. 

Lemma 1. For all 1 <i < n and 1 < j < m, 

D(i,j) = min {D(i - 1, k) + |p* - p,_i - (tj - t k )\} . 
i—l<k<j 

Proof. It is clear that the left hand side is no greater than the right hand side. 
Also by assuming a sequence 1 < h\ < ■ ■ • < hi- \ < j, it is immediate that 

D(i,j) - | pi -pi-i - (tj - t hi _ J| > D(i - l,hj_i) 

and i — 1 < hi- 1 . □ 

This gives a straightforward DP algorithm. Given a pattern p = (pi, . . . ,p n ) 
and a text t = (t\, . . . , t m ), we construct an n x m integer table D. According to 
the definition, all the cells in the first row of D are initialized to 0, and all D(i,j) 
with i > j are filled with oo. Then the remaining cells are filled as follows. 

Algorithm.!.. 

1. For i from 2 to n do the following: 

(a) For j from i to m do the following: 

i- D (i,j)= min {D(i - 1, j') + |p; - p*_i - (tj - 
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For each cell D{i,j), the algorithm looks up j — i + 1 cells in the (i — l)tlr row 
with 0(m) time, and thus the total tabular computation takes 0{nmr) time and 
0{nm) space. To obtain an occurrence, or a subsequence of the text, we invoke 
the “trace back” procedure which is exactly the same with that of approximate 
string matching (e.g., [3].) Enumeration of occurrences takes 0{m ) time. 

Corollary 1. 1-dimensional approximate point set pattern matching can he 
solved in 0{nm 2 ) time and 0(nm ) space with pattern size n and text size m. 

3 Speed-Up for Practical Instances 

Running time of the DP-algorithm presented in the previous section is polyno- 
mial but is quadratic with the size of text. To cope with this point, we utilize 
the following observations for practical instances in a particular application: 
The resolution of positions of points in sequences are limited by accuracy 
of measuring devices, data format, human sense, etc. For example, in a MIDI 
sequence, a time delay (delta time) of a note from the previous one is often spec- 
ified by a resolution 480 per quarter note. In gel electrophoresis technique, the 
size of gel media planes are limited, and positions of spots cannot be accurately 
determined since they appear as stains. 

We summarize these observations and formalize as follows. 

Definition 3. Let L be a positive integer, and let C be a (possibly infinite) set 
of point sequences. We say C has finite resolution L if there is no two contiguous 
points in any sequence inC such that the distance between them is greater than L. 

In the following, we present Algorithm_2, an improved version of Algorithm_l. 
The modification eliminates redundant computations at Step l-(a)-i of Algo- 
rithm_l which searches for the minimum value, though the worst case time- 
complexity order remains same for general instance. However, if text and pattern 
have finite resolution, it becomes 0{nm) time algorithm. 

Algorithm_2. 

1. For i from 2 to n do the following: 

(a) k <— i — 1; /* The “pivot” index of the ztlr row. */ 

(b) D _ <— oo; /* The minimum value in the range f £ [i — 1, k — 1]. */ 

(c) For j from i to m do the following: 

i. £>_ <- D- + (■ tj - tj- 1) if j ^ i\ 

ii. D + <— oo; 

iii. For j' from k to j do the following: 

A. If pi — Pi-i < tj — tj i then 

k <— j 1 + 1; /* Advancing the pivot. */ 

D - <- min {D-,D{i - 1,/) + | p t - Pi-i - {tj - tj>)\}] 

B. else 

D+ <- min {D+,D(i - 1 , /) + \pi - Pi-i - {tj - tj>)\}\ 

iv. D{i, j) <— min{Z)_,D + }; 

At first, we show this algorithm is equivalent to Algorithm.!. 
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Lemma 2. Algorithm-2 completes the tabular computation. 



Proof. Recall the recurrence relation of cells discussed in the previous section. 
By the notion A(i,j,j ') = pi — pi-i — ( tj — tj /), it can be rewritten as follows. 



D(i,j) = min 



“P. , {D(i- l,f) + \A(i,j,j')\}, 



The idea of the algorithm is that the values considered in the right-hand side 
of this equation (except for f = j — 1) are obtained by adding or subtracting 
tj — tj - 1 to the values calculated in the computation of D(i,j — 1). Assume 
that indices i and j are fixed. Since point sequences are strictly increasing, 
A(i,j,j') monotonically increases with f . Thus there is the “pivot” index k 
with i — 1 < k < j such that A(i,j,j') < 0 for i — 1 < j' < k and A(i,j,j') > 0 
for k < j' < j. 

Let us consider the difference between D(i—l,j — l) and D(i— 1, j). Let k(j_ o 
and k( :/) be the pivots for A(i — 1, j — l,j') and A(i — l,j,j'), respectively. Note 
that k(j- 1 ) < k(j) because i is fixed. Since A(i,j—l,j') < 0 for i—1 < j' < ku_ i), 
we can rewrite 

D(i - 1,/) + \A(i,j,f)\ = D(i - 1,/) + | A(i,j - l,j')\ + (tj - tj-!) 

for all? — 1 < j' < This indicates that if D(i — 1,/) + | A(i,j — 1, j')\ is 

minimum for some j' < fc(j-i) then D(i — 1, j 1 ) + \A(i, j, j')\ is also the minimum 
in the same range. This value is stored in D-. For k(j-u < f < fcy), the sign of 
A(i,j,j') changes from plus to minus, and the minimum value is compared with 
and stored in D-. This is carried out in Step l-(c)-iii-A, as well as updates of the 
value k towards k^y For the remaining range k^ < f < j , we enumerate values 
as in Algorithm_l. The result of Step l-(c) is the minimum value among all the 
indices i — 1 < j’ < j. Thus the algorithm computes the same recurrences. □ 

Theorem 1. Let t and p be a text and a pattern point sequences, and let r be 
the ratio of the maximum distance between two contiguous points in p to the 
minimum distance between two contiguous points in t. Then, Algorithm-2 runs 
in 0(nm • r) time with n = |t| and m = \p\. 

Proof. The steps from l-(c)-i to l-(c)-iv of the algorithm takes 0(j — k) time. 
Let L be the maximum distance between two contiguous points in p. For any 
interval I C [t\,t n ] of length 2 L, the number of points of t in I is at most 2 r. 
Then the number of points in [tk,tj] for some j and pivot k, at the beginning 
of Step l-(c)-iii, is 0(r). In another words, k at Step l-(c)-iii advances with the 
control index j in the loop at Step l-(c). □ 

Corollary 2. Algorithm-2 solves 1-D approximate point set pattern matching 
in 0(nm 2 ) time in general, and solves in 0(mn) time if a text and a pattern are 
chosen from a set having finite resolution. 
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3.1 Computational Results 

We briefly show by computational experiments that Algoritlrm_2 achieves dras- 
tic improvement. Table 1 shows execution time required to finish tabular com- 
putations by Algorithm_l and Algorithm_2 for sequences of note-on time 
extracted from Standard MIDI Format (SMF) files on the internet. The algo- 
rithms are implemented by GNU C++ with STL, and tested on AMD Duron 
800 MHz (« Intel Celeron 800 MHz) based machine. A text with 4370 points is 
formed from a concatenation of multiple tracks in one SMF file, and texts with 
* symbol are synthesized by concatenating tracks from multiple files of tunes to 
obtain huge inputs. The maximum distance between two contiguous points in 
patterns A and B is 480. 

Execution time by Algorithm_2 seems to be proportional to text size. The 
maximum number of iterations at Step l-(c)-iii in Algorithm_2 was 57 for these 
sequences. 



Table 1. Execution time for tabular computations with patterns formed from 10 and 
20 points. 



Text 

(points) 


Pattern A (n = 10) 


Pattern B (n = 20) 


Algorithm_l Algorithm_2 


Algorithm_l Algorithm_2 


4370 

20309* 

51173* 

104579* 

1154073* 


7.505 (sec.) 0.019 (sec.) 

226.8 0.048 

1489 0.107 

0.201 
2.238 


22.81 (sec.) 0.039 (sec.) 

573.0 0.094 

3692 0.220 

0.411 
4.577 



3.2 Four Russian Speed Up 

The four-Russian speed up is a well-known technique (e.g. [3]) to finish a n x m 
tabular computation in 0(nm/ log 2 n+?rlog 2 n) time with unit-cost RAM model. 
The technique divides a whole table into boundary-shared squares of a fixed size, 
and only their boundaries are filled by looking up values from a pre-computed 
table. The key is that the value of a cell depends on only 3 cells and two character 
symbols, and thus if an error bound is limited then the number of possible 
combinations of symbols and values is constant. 

In general, this does not hold with point sequences, and applying the four- 
Russian technique to our approximate point set pattern matching is impossible. 
However, if point sequences have finite resolution, then a variant of the four- 
Russian speed up for the tabular computation in Algorithm_2 is available. 

We define rectangular blocks of cells for filling up the DP-table as follows. 



Definition 4. Let C be a set of point sequences having finite resolution L, and 
let p = (pi,...,p n ) and t = (ti, . . . ,tm) be a pattern and a text sequence of 
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Fig. 1. Segments of a block on D and D_, and the corresponding subsequences. 



points in C. We choose constants du > L and dw > 4L appropriately. For any 
1 < i < n and 1 < j < m satisfying Pi + dn < p n and tj + dw < t m , we define 
rectangle block(i,j) of cells of D (and D_) represented by the set of pairs of cell 
indices 



block (i,j) = {(*', j') | Pi < Pi / < Pi + d H ,tj < tj> < tj + d w } ■ 

The number of cells in a block depends on its position. We refer the number 
of rows and the number of columns of block (i,j) to r, and Cj, respectively. 
Also we define, for any S £ Z, the notion plus (i, 6) of the text index such that 
tpius(i,< 5) < ti + S and plus(i , 5) is maximum. 

We place the first block, block(ii.ji ), at i\ = 1 and j i = n. Then, block(ik, ji), 
the ifcth from top and the j;th from left, is placed at ik = ik-i + Ti fc _ 1 — 1, ji = 
plus{ji- 1 , dw — 2dn) + l- By this placement, the first row and the ( plus(j , 2 dn) — 
j + 1) left-aligned columns of block(i,j ), the area U (See Figure 1,) overlaps with 
the upper and the left neighbor blocks. Also, cells in the area Z' are shared with 
the right neighbor block and the lower neighbor block. 

To determine values of cells in Z' and cells of D_ in Y, we only need values 
of cells in [/, subsequences ( pi , . . . ,pi +n _i) and (tj. ... . tj+ Cj - 1 ) of p and £, and 
the cells in D_ corresponding to the area X. We encode the subsequences of 
points by sequences of differences from the previous points, as V = (0,Pi+i — 
Pij • • • iPi+ri — 1 Pi+ri — 2 ) and W — (0 , tjj-i tj . . . . tj- (- 0 ^— 2 )- 

Now we give our algorithm. Firstly, we directly fill the cells (l,j) with 1 < 
j < m and (*, j) with i > j of tables D and Z}_, and compute the cells (i,j) with 
i < j < n and with n< j < plus(n , 2 dn). Secondly, we place blocks as presented 
above, as many as possible. At each placement, we fill the cells in Z' by looking 
up the pre-computed table that maps a combination of values of ( U , V, W, X ) to 
that of ( Z\Y ). The cell values larger than possible error e are omitted as the 
positive infinity symbol oo. Then, all the values in the last row of the DP-table 
D are determined. 

The bottleneck of this algorithm is time complexity of preparing pre-computed 
table. For any block, the number of rows is at most dn, and the number of 
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columns Cj is at most dw- Thus U has at most dw + 2dij(dH — 1) cells, and X 
has at most dn cells. A value of a cell is either 0, 1, . . . , e, oo, or the “no cell” 
symbol. Therefore, the number of possible combinations of values of U and X 
is (e + tyd-w+zdnidH-i) . (g -|- 3) dH_1 . Also, since sequences have finite resolution 
L , there are L dil + + • — | - L < (L + l) dH possible combinations of values 

for V', and ( L + \) dw possible combinations for W' . For each combination of 
quadruple ( U , V 7 , W' , A), its result (Z, Y) can be computed within time 0(2dn • 
(dw — 2 dn) ■ (dn — 1))- By summarizing this, the total time complexity for 

preparing pre-computed table is O (d w ■ d H 2 ■ (e + 3) dw+2da 2 • (L + l) d ^+ d ^ . 
Therefore, the following holds. 

Theorem 2. If an error £ and dn > L can be regarded or chosen as constants, 
then for sufficiently large n and m, we can choose dw = log( L+1 w £+3 ) n > 4 dn 
and the pre- computation table can be prepared in 0(n log n) time. 

Roughly, at most jj^ ■ blocks are placed in each table. By supposing a 
unit-cost RAM model that can look up the pre-computed table in 0(1) time, we 
obtain the following result. 

Corollary 3. If an error bound e is regarded as a constant, and a pattern and a 
text are sufficiently large and drawn from a set C of point sequences having finite 
resolution, then a tabular computation of Algorithm-2 can be done in 0(ff^ + 
nlogn) time with unit-cost RAM model. 

4 Approximate Point Set Pattern Matching on the Plane 

In this section, we introduce an edit distance and an approximate point set pat- 
tern matching on the plane, which is an extension from those of our approximate 
matching of point sequences. Then, we present a polynomial-time algorithm for 
the problem. To our knowledge, this is the first polynomial-time solvable ap- 
proximate point set pattern matching on the plane. 

A point in the plane is a pair p = ( p.x,p.y ) of positive integers. We assume 
that, for any set P C Z +2 of points in the plane, any pair of points p,p' £ P 
satisfies neither p.x = p' .x nor p.y = p'.y. Also, we assume that a set P of points 
has an arbitrary but fixed order of points and can be regarded as a sequence 
P = (pi, . . . ,p n ) of points. 

Let P C Z+ be a set of points. We denote the indices of the right most 
point and the top point in P by (P)r and (P)t- The bounding rectangle of P 
is the minimum axis-parallel rectangle that includes all the points in P. For any 
1 < i,j < n, we define the subset 

P[i,j } = P[j,i\ = {Pk £ P | Pk-x < ma,x(pi.x,pj.x) and p k .y < ma x(pi.y,pj.y)}. 

If the origin set of points is clear in a context, we identify P[i,j\ with its bounding 
rectangle and the pair ((P[i,j])R, (P[i, j])t) of indices, and simply denote it by 
\i,j]. The sets produced by removing the right-most point P(u,j]) R and the top 
point P([i,j]) T from P[i,j\ are denoted by [i,j\- and [i, j]~ , respectively. 

Firstly, we define the edit distance between two sets of points. 
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Fig. 2. The left-bottom aligned bounding rectangle of x points of P define axis-parallel 
space insertions and deletions (the first three ones, numbered in right) for matching to 
o points of Q. 



Definition 5. Let P and Q be sets of n points in the plane. For any 1 < 
j, fc, Z < n, the edit distance d{i.,j\k,l ) between P[i,j] and Q[i,j] is 

1- if\P[i,j} \ = \Q[k,l] \ = 1 then d{i,j\k,l ) = 0, 

2. if \P[i, j}\ ^ \Q[k,l]\ then d(i, j\k,l) = oo, and 

3. otherwise, d{i,j\k,l) = 

mini = ) + ~P([ij]-)R -(?([fc,z])* - 

5 [^7^] ) \P([ij])T P(.[^J]~)t (*?([&, Z])t ^([fc,/] — )t ) I / 

where the distance \ ■ \ between points denotes L\ norm on the plane. 

This definition is naturally extending the edit distance of 1-dimensional point 
sequences to that of bi-axis sorted sequences. Figure 2 is exhibiting this idea. 
Firstly, the left-bottom points x of a set P and o of a set Q are aligned, and are 
treated as bounding rectangles having only one point. This is the base case 1 
of Definition 5. Then, each bounding rectangle is enhanced to have exactly one 
more point. It is implied by the case 2 of the definition. Among all possible pairs 
of such left-bottom aligned bounding rectangles, we choose a pair that minimizes 
the distance between the new added points. This defines the case 3 of the above 
definition. Note that the edit distance between P and Q of size n is always finite. 

The matching is defined as follows. 

Definition 6. Let P = {pi, . . . ,p n } and T = {t \, . . . , t m } be sets of points in 
the plane, and let e > 0 be a positive integer. If there is a subset T' C T such 
that the edit distance d((P)n, (P)t', (T')r, ( T')t ) between P and T' is no more 
than e, then we say that P approximately matches T at the position (T')r. 

Let P be a set of n points. We denote by [i,j]<i and [i, j]- 1 with 1 < l < n 
the sets of the first l points of P[i,j] in A-axis increasing order and the first l 
points in K-axis increasing order, respectively. 

Let P and T be sets of n and m points in the plane. For any 1 < i, j < n and 
any 1 < k, l < to, we denote by D(i,j ; k, l) the minimum edit distance between 
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P[i,j ] and any subset T' C T[k,l\ that includes tk and ti, i.e. T' C T such that 
{T')r = (T[k,l])R and (T')t = (T[/c,/])t- Then the base conditions are defined 
as follows. 



1. If |P[i,j]| = 1 then D(i,j;k,l ) = 0. 

2. If \T[k, Z]| < |P[i,j]| then D(i,j;k,l ) = oo. 

3. If|P[i,j]| = 2 and ([fc, 1])r ^ ([fc, 1])t then D(i,j; k, l) is given by the edit dis- 
tance between P[i,j } and {t([ fcj ;]) H , z]) T }, and thus D(i,j- k,l) = \ P{[i,j]) R ~ 

P([i,j]-)R ~ (%,*])* - t ([k,i]) T ) I = I P([i,j]) T - P{[i,j]-) T - (%M )t ~ *([M)JI- 

In other cases, the following lemma holds. 

Lemma 3. For |P[i, j\\ = n > 2, the following recurrences hold. 

(!) If{[k,l])R = ([M])t, then D(i,j;k,l) = 



min 

\P[i,j]\-l<w<\T[k,l]\ 



D([i,j}--, [k,l]< w ) 

+ \P([i,j]) R -P([i,j]-) R 
[k, l]- w ) 

+ 1 P{\i,i])T - P{[i,i]~)T 



(2) If{[k,l}) r ^ QM)t, then D(i,j;k,l ) = 



(*([*,«])« “ *([M<™)h)I’ 

(t([k,i]) T - i([fe,;]<») T )| 



min 





' D([i,j]-\ ([MkuOfl, ([M])t)+ 1 1 


min < 


ft 

1 

s 

1 

a 

jg 


\T[([k,l])T,m})T}\<w<\T[k,l]\ 


, (*([M)fl “ *(M<»)h)I J l 






min < 


\P([ij]) T -P([i,j]-)R~ > 


|T[([M)H,([M)H]l<™<|r[MI 


. (*([M])t “*(M< w )t)I J , 



Proof. Similar to the proof of lemma 1. 



Now we present an algorithm that finds occurrences at which P approxi- 
mately matches T. Firstly, by enumerating pairs of indices 1 < i,j < n, we 
prepare the list List(P,r) = ((ii,ji), . . . , ( i n (P,r)ijn(P,r ))) of all possible bound- 
ing rectangles of P with r points for 1 < r < n, where n(P, r) denotes the 
number of such rectangles in P. Also, we build List(T, z ) for T and 1 < z < m, 
and denote the number of size z rectangles by n(T, z). It takes 0(n 2 + m 2 ) time. 
Next, we construct a map from a quadruple (■ i,j,k,l ) with 1 < i,j < n and 
1 < k, l < m to D(i,j;k,l), by the base conditions and the following tabular 
computation. 

For r from 2 to n do: 

For z from r to m do: 

For each (i,j) in List(P,r) and for each (k,l) in List(T,z ) do: 
Compute D(i,j ; k,l) by the recurrences. 

The number of quadruples ( i,j , k, l ) is 0(n 2 m 2 ), and for each cell if we can look 
up the table D in constant time a computation for the recurrence requires 0(m 2 ) 
time. Finally, we find and construct all or one of occurrences by a back-tracking 
procedure. 
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Theorem 3. The point set pattern matching on the plane with pattern size n 
and text size m is solvable in 0(n 2 m 4 ) time and 0(n 2 m 2 ) space. 

5 Concluding Remarks 

In musical score sequence search, it is quite needed to regard differences of tempo 
between a pattern sequence and a text sequence. In exact point set pattern 
matching [14], global scaling is also considered in linear transformations. How- 
ever, time insertions and deletions make impossible to applying the method same 
with [14] . An efficient algorithm that can deal with edit distance and global scal- 
ing is the primary open problem. 

The technique utilized in Algoritlrm_2 that separates the difference value 
by its sign relates to “path-separability” in [9]. It is extensively used 
with data structures for minimum range queries on sequences to develop im- 
proved algorithms for edit distances with restrictions on gaps. Improving the 
running time of our algorithm by this scheme for instances without finite reso- 
lution, and its practical implementation should be our future work. 
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Abstract. Given a matrix X composed of symbols, a bicluster is a 
submatrix of X obtained by removing some of the rows and some of the 
columns of X in such a way that each row of what is left reads the same 
string. In this paper, we are concerned with the problem of finding the 
bicluster with the largest area in a large matrix X. The problem is first 
proved to be NP-complete. We present a fast and efficient randomized 
algorithm that discovers the largest bicluster by random projections. A 
detailed probabilistic analysis of the algorithm and an asymptotic study 
of the statistical significance of the solutions are given. We report results 
of extensive simulations on synthetic data. 



1 Introduction 

Clustering refers to the problem of finding a partition of a set of input vectors, 
such that the vectors in each subset (cluster) are “close” to oue another (accord- 
ing to some predefined distance). A common limitation to the large majority 
of clustering algorithms is their inability to perform on high dimensional spaces 
(see, e.g., [1,2]). 

Recent research has focused on the problem of finding hidden sub-structures 
in large matrices composed by thousands of high dimensional vectors (see, e.g., 
[3-10]). This problem is known as biclustering. In biclustering, one is interested 
in determining the similarity in a subset of the dimensions (subset that has to 
be determined as well). Although there exists several definitions of bicluster- 
ing, it can be informally described as the problem of finding a partition of the 
vectors and a subset of the dimensions such that the projections along those 
directions of the vectors in each cluster are close to one another. The problem 
requires to cluster the vectors and the dimensions simultaneously, thus the name 
“biclustering” . 

Biclustering has important applications in several areas, such as data mining, 
machine learning, computational biology, and pattern recognition. Data arising 
from text analysis, market-basket data analysis, web logs, etc., is usually ar- 
ranged in a contingency table or co-occurrence table, such as, a word-document 
table, a product-user table, a epu-job table or a webpage-user table. Discovering 
a large bicluster in a product-user matrix indicates, for example, which users 
share the same preferences. Finding biclusters has therefore applications in rec- 
ommender systems and collaborative filtering, identifying web communities, load 
balancing, discovering association rules, among others. 



S.C. Sahinalp et al. (Eds.): CPM 2004, LNCS 3109, pp. 102-116, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In computational biology, this problem is associated with the analysis of gene 
expression data obtained from microarray experiments. Gene expression data is 
typically arranged in a table with rows corresponding to genes, and columns cor- 
responding to patients, tissues, time points, etc. The classical approach to ana- 
lyze microarray data is clustering. The process of clustering partitions genes into 
mutually exclusive clusters under the assumption that genes that are involved in 
the same genetic pathway behave similarly across all the testing conditions. The 
assumption might be true when the testing conditions are associated with time 
points. However, when the testing conditions are heterogeneous, such as patients 
or tissues, the previous assumption is not appropriate anymore. One would ex- 
pect that a group of genes would exhibit similar expression patterns only in a 
subset of conditions, such as the subset of patients suffering from the same type 
of disease. Under this circumstance, biclustering becomes the alternative to the 
traditional clustering paradigm. The results of biclustering may enable one to 
discover hidden structures in gene expression data in which many genetic path- 
ways might be embedded. It might also allow one to uncover unknown genetic 
pathways, or to assign functions to unknown genes in already known genetic 
pathways. 

Biclustering is indeed, not a new problem. In fact, it is also known under 
several other names, namely “co-clustering”, “two-way clustering” and “direct 
clustering”. The problem was first introduced in the seventies in a paper by 
Hartigan [11]. Almost thirty years later, Cheng and Church [3] raised the interest 
on this problem for applications in gene expression data analysis. 

Several other researchers studied the problem recently. Wang et al. propose 
the pCluster model that is capable of discovering shifting or scaling patterns 
from raw data sets [4] . Tanay et al. [5] combine a graph-theoretic approach with 
a statistical modeling of the data to discover biclusters in large gene expression 
datasets. Ben-Dor et al. [6] introduce a new notion of a bicluster called order 
preserving submatrix , which is a group of genes whose expression level induces a 
linear ordering across a subset of the conditions. Murali and Kasif [12] (see also 
[10]) propose the concept of xmotif . [ which is defined as a subset of genes whose 
expression is simultaneously conserved for a subset of samples. 

As we were writing this document, we became aware of two other contribu- 
tions to the subject, by Sheng et al. [8], and Mislrra et al. [9], that use a ran- 
domized approach similar with the work described here. Sheng et al. [8] propose 
a randomized algorithm based on Gibbs sampling to discover large biclusters in 
gene expression data. Their model of a bicluster is probabilistic, that is, each en- 
try of the matrix is associated with a probability. Mishra et al. [9] are concerned 
with the problem of finding e-bicliques which maximizes the number of edges 1 . 
Given a bipartite graph (f7, V, E), a subgraph ([/', V') is e-biclique if each vertex 
in U' is a neighbor of at least (1 — e) fraction of vertices in V' . The authors 
give an efficient randomized algorithm that finds the largest e-biclique, but no 
experimental results are reported. 



i 



the connection between bicliques and bicluster will be explained in detail in Section 2 
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As shown in papers [12] and [8], the problem of biclustering gene expression 
data can be formulated on a discrete domain, by first discretizing the gene expres- 
sion matrix into a matrix over a finite alphabet. The simplifying assumption is 
that the set of states in which each gene operates is finite, such as up-regulated, 
down-regulated or unchanged. Once the data is discretized into strings where 
each symbol corresponds to a state, the biclustering problem reduces to the 
problem of finding a subset of the rows and a subset of the columns such that 
the submatrix induced has the property that each row reads the same string. 
Such a submatrix would therefore correspond to a group of genes that exhibit 
a coherent pattern of states over a subset of conditions. This is indeed the for- 
mulation of the problem that we define in Section 2, which is first proved to 
be NP-complete. In Section 3 we present a randomized algorithm which is effi- 
cient and easy to understand and implement. Section 4 presents an asymptotic 
analysis that allows one to determine the statistical significance of the solution. 
Finally, in Section 5 we report simulation results on synthetic data. 



2 Notations and Problem Definition 

We use standard concepts and notation about strings. The set £ denotes a 
nonempty alphabet of symbols and a string over £ is an ordered sequence of 
symbols from the alphabet. We use the variable a as a shorthand for the cardi- 
nality of the set £, that is, a = \£\. Given a string x, the number of symbols in 
x defines the length |x| of a;. 

Similarly, we can define a two-dimensional n x m string (or matrix) X £ 
jjnxm Qver ij le alphabet JJ. The element (i,j) of X is denoted by Xujy A row 
selection of size k of X is defined as the subset of the rows R = {«i , * 2 , • • • , *fc}, 
where 1 < i s < n for all 1 < s < k. Similarly, a column selection of size l of X 
is defined as a subset of the columns C = {j\ . j-z , where 1 < j t < m for 
all 1 < t < l. 

The submatrix X^ R c) induced by the pair (R, C ) is defined as the matrix 







X[n,ji] 




X[i 2 ,j2\ 






X[i k ,j 2 ] 





Given a selection of rows R, we say that a column j, 1 < j < m, is clean 
with respect to R if the symbols in the j - th column of X restricted to the rows 
i?, are identical. 

The problem addressed in this paper is defined as follows. 

Largest bicluster(/) problem 

Instance: A matrix X £ I7” xrn over ^j ie alphabet £. 

Question: Find a row selection R and a column selection C such that the 
rows of X( R ' C j are identical strings and the objective function f(X( Rt c)) is 
maximized. 
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Some examples of objective functions are the following. 

- h (*(fl,o) = \R\ + \C\ 

- h ( X (R,C)) = |-R| provided that \C\ = |i?| 

- h ( X (R,C)) = |i?l \C\ 

The problem in general may have multiple solutions which optimize the ob- 
jective function. The solutions may also “overlap”, that is, they may share some 
elements of the original matrix. 

The computational complexity of this family of problems depends on the 
objective function /. In the literature, the problem has been studied mostly from 
a graph-theoretical viewpoint which corresponds to the special case £ = {0, 1}. 
In fact, observe that a matrix X £ {0, l}" xm is the adjacency matrix of a 
bipartite graph G = (Vi, V 2 ,E) with \Vi\ = n and | V 2 I = m. An edge (i, j) € E 
connects node i £ V 1 to node j £ V 2 if X.;j = 1. Thus, a submatrix of l’s in X 
corresponds to a subgraph of G which is completely connected. Such a subgraph 
is called a biclique. Because of this relation, we use the terms “submatrix”, 
“biclique”, and “bicluster” interchangeably. 

When the alphabet is binary and we are looking for the largest submatrix 
composed only by l’s 2 , the Largest bicluster reduces to well-known prob- 
lems on bipartite graphs. More specifically, the Largest bicluster problem 
associated with objective function /1 is known as the Maximum Vertex Bi- 
clique problem, and it can be solved in polynomial time because it is equivalent 
to the maximum independent set in bipartite graphs which, in turn, can be solved 
by a minimum cut algorithm (see, e.g., [13]). The same problem with objective 
function over a binary alphabet is called Balanced Complete Bipartite 
Subgraph problem or Balanced Biclique problem and it is listed as GT24 
among the NP-complete problems in Garey & Johnson’s book [14] (see also 
[15]). 

The Largest bicluster problem with objective function fs and £ = {0, 1} 
is called Maximum Edge Biclique problem. The problem requires to find the 
biclique which has the maximum number of edges. The problem is proved to 
be NP-complete in [16] by reduction from 3Sat. The weighted version of this 
problem is shown NP-complete by Dawande et al. [17]. 

In [13] Hochbaum studies a problem related to Maximum Edge Biclique, 
which is the problem of finding the number of edges that need to be deleted so 
that the resulting graph is a biclique. Hochbaum describes a 2-approximation 
algorithm based on LP-relaxation. According to Pasechnik [18] this approxima- 
tion ratio does not hold for the original Maximum Edge Biclique problem. 
Pasechnik shows a semidefinite relaxation, and claims that his relaxation is in 
general better than [13]. 

The following theorem establishes the hardness of the problem of finding the 
largest area bicluster over a general alphabet. For lack of space the proof is 
omitted. 

2 In general, a solution of the largest bicluster can contain a column of zeros, as long 
as they appear in all rows of the submatrix 
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Theorem 1 . The decision problem associated with Largest bicluster(/3) is 
NP-complete. 

By the same approach, Largest bicluster(/ 2 ) can also be proved to be 
NP-complete. In the rest of this paper we will concentrate our attention on 
the problem of finding the largest-area bicluster. For practical reasons that will 
become apparent in Section 3, the objective function that we are maximizing is 

/ 3 ( X( R , f, c) = |i?| \C\ provided that |i?| > f and |Cj > c 
where r and c are two input parameters. 



3 Randomized Search 

Given that Largest bicluster(/ 3 ) problem is NP-complete, it is unlikely 
that a polynomial time algorithm could be found. In this paper, we present a 
randomized algorithm which finds a maximal solution with probability 1 — e, 
where 0 < e < 1. 

Assume that we are given a large matrix X £ jjnxm j n w hj c h a submatrix 
X(r* t c*) is implanted. Assume also that the submatrix is maximal. To 

simplify the notation, let r* = |i?*| and c* = |C*|. 

The idea behind the algorithm comes from the following simple observation. 
Observe that if we knew R* , then C* could be determined by selecting the clean 
columns with respect to R* . If instead we knew C*, then R* could be obtained 
by taking the maximal set of rows which read the same string. Unfortunately, 
neither R* nor C* is known. Our approach is to “sample” the matrix by random 
projections, with the expectation that at least some of the projections will over- 
lap with the solution ( R*,C *). Clearly, one can project either rows or columns. 
In what follows we describe how to retrieve the solution by sampling columns. 

The algorithm works as follows. Select a random subset S of size k uniformly 
from the set of columns {1, 2, . . . , to}. Assume for the time being that SnC* y^ 0. 
If we knew SnC*, then ( R *, C*) could be determined by the following three steps 

(1) select the string(s) w that appear exactly r* times in the rows of A'[ 1:lli s n <y.], 

(2) set R* to be the set of rows in which w appears and (3) set C* to be the set 
of clean columns corresponding to R* . 

The algorithm would work, but there are a few problems that are still un- 
resolved. First, the set S fl C* could be empty. The solution is to try differ- 
ent random projections S , relying on the argument that the probability that 
S'nC* y^ 0 at least once will approach one with more and more projections. The 
second problem is that we do not really know SnC*. But, certainly SnC* C S, 
so our approach is to check all possible subsets CCS such that \U\ > k m j n , 
where 1 < fc m < k is a user-defined parameter. The final problem is that we 
assumed that we knew r* , but we do not. The solution is to introduce a row 
threshold parameter, called f, that replaces r*. 

As it turns out, we need another parameter to avoid producing solutions with 
too few columns. The column threshold c is used to discard submatrices whose 
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S SnC* c 




C* 



Fig. 1. An illustration of a recovery of the embedded matrix by random projections. 
C* is the set of columns containing the embedded submatrix. S is a random selection 
of columns. By following the steps described in the text, the correct solution can be 
easily retrieved. 



number of columns is smaller than c. The algorithm considers all the submatrices 
which satisfy the user-defined row and column threshold as candidates. Among 
all candidate submatrices, only the ones that maximize the total area are kept. 
A sketch of the algorithm is shown in Figure 2. As noted in the introduction, a 
very similar strategy was developed independently and concurrently by Mislrra 
et al. [9]. 

The algorithm depends on five key parameters, namely the projection size k, 
the minimum subset size fc m ; n , the row threshold f, the column threshold c, and 
the number of iterations t. We discuss how to choose each of these in the rest of 
the section. 



Parameter Selection. The projection size k is determined by a probabilistic ar- 
gument. It is well-known that in a random string of size in over an alphabet of 
size a, the number of occurrences of substrings has two different probabilistic 
regimes (1) Gaussian distributed for strings shorter than log a m and (2) Poisson 
distributed for strings longer than log a m (see, e.g., [19]). Based on this observa- 
tion, when fc m in = k we argue that k = log a m is the optimal trade-off between 
generating too many trivial solutions (k too small) and potentially missing the 
solution (k too large). This value of k has been confirmed to be the optimal 
choice in our simulations. When fc m ; n = 1, then k can be chosen significantly 
larger, but this will adversely affect the running time. An experimental com- 
parison between k m i n = k (i.e., no subsets), and A m i n = 1 (i.e., all subsets) is 
reported in Section 5.1. 

The thresholds r and c are associated with the uncertainty on the size of the 
largest submatrix r* , c* for a particular input instance. There may be situations 
in which the user has already a good idea about r* , c*. If however r* and c* are 
completely unknown, then our target will be to find “statistically significant” 
biclusters. In Section 4 we will present a theorem (Tlrorem 2) which gives the 
expected number of columns of the largest submatrix in a random matrix, when 
the number of rows is fixed. Based on this, we propose the following trial-and- 
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Largest_Bicluster_C(X, f, k , fc m i n , f , c) 

Input: X is a n x m matrix over E 
t is the number of iterations 
k is the projection size 

fcmin is the size of the smallest subset of the projection 
r , c are the “thresholds” on the number of rows and columns, resp. 

1 repeat I, times 

2 select randomly a subset S of columns such that |Sj = k 

3 for all subsets U C S such that \U\ > k m in do 

4 D <— all strings induced by Xp :n ,v] that appear at least r times 

5 for each string w in D 

6 V <— rows corresponding to w 

7 Z <— all “clean” columns corresponding to V 

8 if \Z\ > c then save (V, Z) 

9 return the (V, Z) that maximizes / 



Fig. 2. A sketch of the algorithm that discovers large biclusters (sampling columns). 



error strategy. Set f to some value between 1 and n, and use Theorem 2 to set 
the value c. Run the algorithm. If the algorithm returns too many solutions, try 
to increase r and update c correspondingly. If there are no solutions, lower the 
value of f and repeat. Observe that the number of choices for r is finite since 
f € [l,n]. By using Theorem 2 to set the threshold c, we are trying to filter out 
submatrices whose size is small enough that they could appear in totally random 
matrices. 

Because of the randomized nature of the approach, there is no guarantee 
that the algorithm will find the solution after a given number of iterations. We 
therefore need to choose t so that the probability that the algorithm will recover 
the solution in at least one of the t trials is 1 — e, where 0 < e < 1 is a user-defined 
parameter. 

Let a(n,m,k,r*,c*,a) be the probability of missing the solution in one of 
the trials assuming that r* and c* are known and that k m i n = 1. There are two 
disjoint cases in which the algorithm can miss ( R*,C *). The first is when the 
random projection S misses completely C* , i.e., STlC* = 0. The second is when 
S fl C* = (7 / 0 but the string w chosen by the algorithm among the rows 
X[\-.n,u] also appears in another row that does not belong to the set R* of the 
real solution. In this case, the algorithm will select a set of rows larger than R* 
and thus miss the solution. Hence, we have 



k 

a(n,m,k,r*,c*,a) = Pr{S<lC* =<b} + J2 Pr {\ SnC *\=i and \R\ > r*} 

i= 1 
k 

= Pr{SnC* =0} + ^Pr{|i?| > r* given \S n C*\ = i}Pr{|S n C*\ = i} 

i = 1 
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Let Y be the random variable associated with the size of the set S n C* , 
that is, Y = |£n C*\. Since we are sampling without replacement, Y follows the 
hyper-geometric distribution. 



Pr{F = 0} = C ) / (jf) and Pr{Y = i} 




In order to compute the probability of missing the solution given 15 fl C* | = i, 
we have to estimate how likely a string w belonging to some of the rows of X\i. n m 
is more frequent than r* . Assuming the symbols in the matrix X are generated 
by a symmetric Bernoulli i.i.d. model, the probability that w will never appear 
in the other n — r* rows is (l — -A)" r and therefore 



Pr{|i?| > r* given |5nC*| 



i}= 1 - 




n—r* 



Combining all together, the probability of missing the solution in one itera- 
tion is given by 



a(n, to, k, r*,c*, a) 



(V) + £?.. (i - (i - 
(?) 






*) 



Now suppose we want the probability of missing the solution to be smaller 
than a given e, 0 < e < 1. We can obtain the number of iterations t by solving 
the inequality (a(n, m, k, r*, c*, a ))* < e, which gives 



t > 



loge 

log a(n, m, fc, r * , c*, a) 



(1) 



This bound on the number of iterations has been verified by our experimental 
results (compare Table 1 with our experimental results shown in Figure 3). For 



Table 1 . The estimated number of iterations for a matrix 256 x 256 with a submatrix 
64 x 64, for different choices of e, alphabet size a, and projection size k (sampling 
columns). 



e 


a — 2, k = 


0.005 


18794 


0.05 


10626 


0.1 


8168 


0.2 


5709 


0.3 


4271 


0.4 


3250 


0.5 


2459 


0.6 


1812 


0.7 


1265 


0.8 


792 


0.9 


374 



4, fc = 4 


a = 8, k 


1342 


306 


759 


173 


583 


133 


408 


93 


305 


70 


232 


53 


176 


40 


129 


29 


90 


21 


57 


13 


27 


6 



16, fc = 2 


a = 32, k 


179 


99 


101 


56 


78 


43 


54 


30 


41 


23 


31 


17 


23 


13 


17 


10 


12 


7 


8 


4 


4 


2 
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example, by setting a = 4, k = 4, e = 0.7, equation (1) gives t = 90 itera- 
tions whereas the experimental results show that with 90 iterations we obtain a 
performance of e = 0.689. 

The worst case time complexity of Largest_Bicluster_C is bounded by 
O Q)(kn + nm)^ . If fc m i n = 1, then the time complexity becomes 

O (t2 k (kn + nm)) . Although the complexity is exponential in k, choosing k to 
be 0(log a m) makes the algorithm run in O (trn 1//log2 a (kn + nm)) time. 

The probability of missing the solution changes significantly when we set 
fcmin = k. In this case, we are not checking any of the subsets of S, but we 
simply rely on the fact that eventually one of the random projections S will 
end up completely contained in C* , in which case we have a chance to find the 
solution. 

Since we avoid checking the 0(2 k ) subsets of S, the number of iterations t to 
achieve the same level of performance of the case fc m ; n = 1 must be significantly 
larger. Indeed, by a similar argument as we did for fc m i n = 1, the probability of 
missing the solution when fc m i n = fc can be estimated by the following formula 

a{n,m,k,r*,c*,a) = Pr{\SnC*\ < fc} + PrflSn C*\ = k and \R\ > r*} 

= 1 - Pr{|SnC*| = fc} + Pr{|S'nC , *| = fc and \R\ > r*} 




As mentioned above, we also have the option to project the rows instead of 
the columns, which would result in a slightly different algorithm that we called 
Largest_Bicluster_R. The details and the analysis of this algorithm will be 
reported in the journal version of this manuscript. 

Both strategies were implemented and tested extensively. Results are re- 
ported in Section 5. 



4 Statistical Analysis 

We now analyze the statistical significance of finding a large submatrix of size 
r x c hidden into a random n x m matrix over an alphabet of cardinality a. 
More specifically, we randomly generate a matrix X £ y« xm using a memoryless 
source with parameters {p \ , . . . , p a } where pi is the probability of the i-th symbol 
in E. Given X , the goal is to characterize asymptotically the size of the largest 
submatrix in X . 

For convenience of notation, let us call P r = p\ +P 2 + ■ ■ ■ +Pa the probability 
of observing a clean column over r rows, and let us define H(x) = — x\n.x— (1 — 
a:) ln(l — x). 




Finding Biclusters by Random Projections 111 



Table 2. The statistics of large submatrices in a random {0, l}-matrix of size 256 x 256. 
The second column reports the number of columns of the submatrices observed in a 
random matrix, whereas the third reports the prediction based on Theorem 2. 



rows 


columns observed 


columns predicted 


i 


256 


256 


2 


160 


165.6771209 


3 


100 


103.9626215 


4 


67 


67.24371945 


5 


45 


44.84053788 


6 


31 


30.70906224 


7 


23 


21.48364693 


8 


16 


15.26873716 



The first result characterizes the random variable associated with the number 
of columns of the largest bicluster, when we fix the number of rows. Both proofs 
are omitted due to lack of space. 

Theorem 2. Let C n ,m,r,a be the random variable associated with the number of 
columns of the submatrix with the largest area in a matrix X £ E nxm generated 
from a memoryless source, once the number of rows r is fixed. Then 

O n ,m,r,a ^ TYlP r ^/2.P r (l P 1 f)mF(n,r} = Cmax 

with high probability and as n — » oo, where 

F(n r) = frlogn if r = o(n) 

’ ' \ nH(a) if r = an where 0 < a < 1 



When r = o(n) the error term is 0(1/ \og d n) for some d > 1 that may 
depend on a, whereas the error becomes 0(1 /y/n) when r = an. The prediction 
on random matrices is indeed quite accurate as reported in Table 2. We claim 
that the upper bound is actually an equality, that is, asymptotically and with 
high probability C ra>mjr . i0 = C m ax . 

The practical implications of Theorem 2 are twofold. First, the expected 
number of columns can be used to set the column threshold parameter c 
maxICmax, 1}. That allows the algorithm to avoid considering statistically non- 
significant submatrices. Second, observe that when logn = o(m), then the dom- 
inant term of C max is the average, say E[C], of the number of clean columns, 
that is, E[C] = mP r . This implies C ma ,^/Pi[C] < 1 + o(l) for logn = o(m), and 
therefore with high probability any algorithm is asymptotically optimal. Clearly, 
this is not true for r = an. Finally, in passing we add that when we restrict the 
search to largest squared matrix (see objective function fi above), then its side 
is asymptotically equal to 2 log (n/ (2 logn)) / log Pf 1 . 

The second result characterizes the random variable associated with the area 
of the solution. For convenience of notation, given a memoryless source with 
parameters {pi, . . . ,p a } we define p max = max\<i< a Pi- 
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Theorem 3. Let A ntm ^ a be the random variable associated with the area of the 
largest submatrix in a matrix X £ X' rixm , m < n, generated from a memoryless 
source. Then, with high probability for any e > 0 and as n — > oo 

A n rn ^ a £ (1 d e)rc 

where r = n/2 and c = 21n2/lnp“^ x . 

The intuition behind Theorem 3 is that on random matrices one should 
expect the largest submatrix to be “skinny”, that is, a few columns and lots of 
rows, or vice versa. For example, we expect the largest submatrix in a random 
{0, l}-matrix of size 256 x 256 to be size 2 x 160 (see Table 2). 

5 Implementation and Experiments 

We implemented column- and row-sampling algorithms in C++ and tested the 
programs on a desktop PC with a 1.2GHz Athlon CPU and 1GB of RAM, under 
Linux. Although the algorithms do not require sophisticated data structures, 
in order to carry out step 4 in the algorithm of Figure 2, one needs a data 
structure to store the strings and their frequencies. Since k and a are usually 
not very large, our experience shows that a simple hash table (of size a k ) is a 
good choice. If a k becomes too large, a trie would be a better data structure. 
If one uses the hash table, it is important to keep track of the non-zero entries 
in another balanced data structure. That would avoid the algorithm to spend 
0(a k ) to search for the frequently occurring strings. Observe also that row- 
sampling algorithm does not require any hash table, or any other data structure. 
However, our experiments show that in order to get the same level of performance 
of the column sampling, the row sampling strategy needs a significantly larger 
projection k which adversely affects the running time. 

Another issue is whether one should keep track of the projections generated 
so far to avoid generating duplicates. We studied this matter experimentally, and 
found that it is worthwhile to keep track of the projections in some balanced 
data structure only when k is small. If k is large, the overhead required to keep 
the data structure updated is much higher than the time wasted in processing 
the same projection multiple times. 

5.1 Simulations 

In order to evaluate the performance of the algorithms, we designed several sim- 
ulation experiments. In these experiments we randomly generated one thousand 
256 x 256 matrices of symbols drawn from an symmetric i.ixl. distribution over 
an alphabet of cardinality a = 2,4, 8, 16, 32. Then, in each matrix we embedded 
a random 64 x 64 submatrix at random columns and random rows. We ran the 
algorithms for a few tens of iterations (f = 5, . . . , 100), and for each choice of t we 
measured the number of successes out of the 1,000 distinct instances. Figure 3 
summarizes the performance of Largest_Bicluster_C, for several choices of 
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iterations (t) 



— A — a = 32, k = 2, kmin= 1 —a — a = 32, k = 2, kmin= 2 —a — a = 16, k = 2, kmin= 1 — <= — a = 16, k = 2, kmin= 2 — ° — a = 8, k = 3, kmin= 1 
— o — a = 8, k = 3, kmin= 3 — o — a = 4, k = 4, kmin= 1 — o — a = 4, k = 4, kmin= 4 — JK — a = 2, k = 8, kmin= 1 — x — a = 2, k = 8, kmin= 8 



Fig. 3. Comparing the performance of the randomized algorithm Largest_Biclus- 
ter_C when fc m j n = k versus fc m i n = 1, for different choices of the alphabet size a. The 
projection size is k = log a to. 



alphabet size a and projection size k 7 and minimum subset size k m i n . Figure 4 
summarizes the performance of Largest_Bicluster_R under the same condi- 
tions. 

In order to make a fair comparison between fc m ; n = k and /c m ; n = 1, the 
number of iterations for the case fc m ; n = k was multiplied by 2 fc — 1. Note that 
by doing so, we are assuming that one projection for fc m ; n = 1 takes about the 
same time as one projection for fc m ; n = k, which is not necessarily very accurate. 
Under this assumption, however, k m - ln = k outperforms A; m i n = 1 (see Figure 3). 
This not necessarily true in the row sampling strategy (see Figure 4). 

By comparing the performance of row sampling against column sampling, 
one can observe that if one uses the same set of parameters, column sampling 
always outperforms row sampling. 

Unfortunately, we were unable to compare the performance of our random- 
ized approach to other biclustering algorithms (e.g. [3,4,6,5,12,8]), because 
their notion of bicluster is generally different from ours. 

6 Conclusions 

In this paper we have introduced the Largest Bicluster problem. This prob- 
lem has a variety of applications ranging from computational biology to data 
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_a = 32, k = 4, kmin = 1 ^ — a = 32, k = 4, kmin = 4 

-a = 8, k = 5, kmin = 5 — o — a = 4, k = 7, kmin = 1 



= 16, k = 4, kmin = 1 
= 4, k = 7, kmin = 7 



— a = 16, k = 4, kmin = 4 
— a = 2, k = 11, kmin = 1 



_ a = 8, k = 5, kmin = 1 
-a = 2, k= 11, kmin = 11 



Fig. 4. Comparing the performance of the randomized algorithm Largest_Biclus- 
ter_R for different choices of the alphabet size a and projection size k. 



mining. As far as we know, the pattern matching community has not looked yet 
at this problem from a combinatorial perspective. Unfortunately, the problem is 
generally NP complete. 

Here we presented a rather simple algorithm based on random projections. Its 
performance with respect to the number of projection was carefully analyzed. 
We have also presented a probabilistic analysis of the Largest Bicluster 
problem, which allows one to determine the statistical significance of a solution. 

Our approach performs remarkably well on synthetic data. On large alpha- 
bets, thirty or so iterations are enough to give a performance close to 100%. 
With respect to other biclustering algorithms (see e.g., [3, 12,8]), our algorithm 
simultaneously discovers multiple solutions which satisfy the user-defined pa- 
rameters without masking or changing the original data. In addition to this, the 
algorithm will never report solutions which are completely contained in other 
solutions. 
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Abstract. We study a problem of efficient utilisation of extra memory space in 
real-time string matching. We propose, for any constant e > 0, a real-time string 
matching algorithm claiming 0(m e ) extra space, where m is the size of a pat- 
tern. All previously known real-time string matching algorithms use l?(m) extra 
space. 



1 Introduction 

A string matching problem is one of the most studied problems in algorithmic theory 
of computing. In standard string matching problem we are interested in finding all oc- 
currences of one string P, called a pattern , in another (usually much longer) string T, 
called a text. Frequently, we denote the length of the pattern by |P| = m and the length 
of the text by |T| = n. We say that the string matching algorithm is optimal if its 
total complexity is linear in n and m. A number of optimal linear time string match- 
ing algorithms has been proposed in the literature. Some of them perform well in the 
worst case, see, e.g., [12], and some of them work fast in the average case [1, 11]. In 
the context of some particular applications, we are interested in other aspects of the 
string matching problem. E.g., in design of optimal linear time algorithms requiring 
only constant extra space, see, e.g., [4, 7-9]. Note here that most of the string matching 
algorithms use 0(rn) extra memory. In other applications we insist that the input text 
symbols are assessed consecutively, without possibility of backtracking, i.e., looking at 
symbols previously assessed. We call these on-line algorithms. A number of optimal 
(linear time) on-line algorithms have been proposed, in case where a linear extra space 
is available, see, e.g., [2,3,6]. Very recently, an optimal constant space on-line string 
matching algorithm was introduced in the context of compressed matching, see [10]. 
However, in some applications, e.g., when the text is a part of a large streamed data, 
we might be forced to assess input text symbols in real-time. I.e., the time difference 
between assessments of two consecutive text symbols must be bounded by a constant. 
This creates a need for real-time string matching algorithms. The first real-time string 
matching procedure was introduced in [5], However, as most of the algorithms it uses 
and extra linear space. In this paper we discuss more efficient utilisation of memory in 
real-time string matching. We propose here a real-time algorithm that uses only 0(m £ ) 
extra space, for any constant e > 0. 

The paper is organised as follows. In section 2 we introduce a notion of a partial next 
function and we prove a number of its useful properties. In section 3 we present a crude 
version of our real-time string matching algorithm which is followed by a discussion 
on possible improvements in section 4. 
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